国外 电子 与 电气 工程 技术 丛书 


BUT AL Bai Seb PE 


(英文 版 第 2 版 ) 


[土耳其 ] A. Bt - 泰 卡尔 普 (A. Murat Tekalp ) # 








A. MURAT TEKALP 








| 

TELE ~ a 
E | 

| 


多 年 来 ， 《Digital Video Processing》 都 是 无 数 工 科学 生 和 专业 人 士 深入 学 习 数字 图 像 和 视频 处 理 
技术 的 权威 指南 。 在 《Digital Video Processing》 第 2 版 中 ， 作 者 对 图 像 处 理 、 计 算 机 视觉 、 视 频 压缩 等 
领域 的 重大 发 展 进行 了 探讨 ， 也 对 诸如 数字 电影 、 超 高 分 辨 率 视频 、3D 视 频 等 新 应 用 进行 介绍 。 

全 书 内 容 详 尽 、 组 织 均 衡 、 论 述 严 谨 ， 全 面 覆盖 了 图 像 滤 波 、 运 动 估计 、 跟 踪 、 分 割 、 视 频 滤波 和 
压缩 等 诸多 方向 。 书 中 对 各 章节 的 习题 都 进行 了 更 新 ， 并 加 入 了 新 的 MATLAB 项 目 ， 已 使 本 书 成 为 一 本 
全 新 的 教材 。 


内 容 包 括 : 


e 多 维 信 号 与 系统 : 转换 、 采 样 、 格 式 转换 。 

数字 图 像 和 视频 : 人 类 视觉 、 数 字 视 频 、 视 频 质量 。 

图 像 滤波 : 梯度 估计 ， 边 缘 检测 ， 尺 度 缩放 ， 多 分 辨 率 表 示 、 增 强 、 去 品 、 复 原 。 

运动 估计 : 成 像 ， 运 动 模型 ， 有 差分 法 、 匹 配 法 、 优 化 法 、 变 换 域 方法 ，3D 运 动 与 形状 估计 。 
视频 分 割 与 跟踪 : 色彩 与 运动 分 割 、 变 化 检测 、 镜 头 边界 检测 、 视 频 抠 图 、 视 频 跟 踪 与 性 能 评估 。 
视频 滤波 : 运动 补偿 滤波 ， 多 帧 标准 转换 ， 多 帧 噪声 过 滤 、 复 原 ， 超 分 辨 率 重建 。 

图 像 压缩 : JPEG、 小 波 、JPEG 2000, 

视频 压缩 : 早期 标准 、ITU-T H.264/ MPEG-4 AVC、HEVC、 可 扩展 视频 压缩 。 立体 视觉 和 多 视 
图 法 。 
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出 版 者 的 话 


文艺 复兴 以 来 ， 源 远 流 长 的 科学 精神 和 逐步 形成 的 学 术 规 范 ， 使 西方 国家 在 自然 科学 的 
各 个 领域 取得 了 垄断 性 的 优势 ， 也 正 是 这 样 的 传统 ， 使 美国 在 信息 技术 发 展 的 六 十 多 年 间 名 
家 辈出 、 独 领 风骚 。 在 商业 化 的 进程 中 ， 美 国 的 产业 界 与 教育 界 越 来 越 紧密 地 结合 ， 信 息 学 
科 中 的 许多 泰山 北斗 同时 身 处 科研 和 教学 的 最 前 线 ， 由 此 而 产生 的 经 典 科学 著作 ， 不 仅 壁 划 
了 研究 的 范畴 ， 还 揭示 了 学 术 的 源 变 ， 既 遵循 学 术 规 范 ， 又 自 有 学 者 个 性 ， 其 价值 并 不 会 因 
年 月 的 流逝 而 减退 。 

近年 ， 在 全 球 信 息 化 大 潮 的 推动 下 ， 我 国 的 信息 产业 发 展 迅猛 ， 对 专业 人 才 的 需求 日 益 
迫切 。 这 对 我 国教 育 界 和 出 版 界 都 既是 机 遇 ， 也 是 挑战 ; 而 专业 教材 的 建设 在 教育 战略 上 显 
得 举足轻重 。 在 我 国信 息 技术 发 展 时 间 较 短 的 现状 下 ， 美 国 等 发 达 国 家 在 其 信息 科学 发 展 的 
几 十 年 间 积 淀 和 发 展 的 经 典 教材 仍 有 许多 值得 借鉴 之 处 。 因 此 ， 引 进 一 批 国外 优秀 教材 将 对 
我 国教 育 事业 的 发 展 起 到 积极 的 推动 作用 ， 也 是 与 世界 接轨 、 建 设 真正 的 世界 一 流 大 学 的 必 
由 之 路 。 

机 械 工 业 出 版 社 华章 公司 较 早 意识 到 “出 版 要 为 教育 服务 ”。 自 1998 年 开始 ,我们 
就 将 工作 重点 放 在 了 六 选 、 移 译 国外 优秀 教材 上 。 经 过 多 年 的 不 懈 努 力 ， 我 们 与 Pearson、 
McGraw-Hill, Elsevier, John Wiley & Sons、CRC、Springer 等 世界 著名 出 版 公司 建立 了 良 
好 的 合作 关系 ， 从 他 们 现 有 的 数 百 种 教材 中 甄选 出 Thomas L. Floyd, Charles K. Alexander, 
Behzad Razavi, John G. Proakis 、Stephen Brown, Allan R. Hambley, Albert Malvino, Mark 
I.Montrose, David A. Johns , Peter Wilson , H. Vincent Poor, Dikshitulu K. Kalluri, Bhag 
Singh Guru, Stephane Mallat 等 大 师 名 家 的 经 典 教材 ， 以 “国外 电子 与 电气 工程 技术 丛书 ”为 
总 称 出 版 ， 供 读者 学 习 、 研 究 及 珍藏 。 这 些 书 籍 在 读者 中 树立 了 良好 的 口碑 ， 并 被 许多 高 校 
采用 为 正式 教材 和 参考 书籍 。 其 影印 版 “经 典 原 版 书库 ”作为 姊妹 篇 也 越 来 越 多 被 实施 双语 
教学 的 学 校 所 采用 。 : 

权威 的 作者 、 经 典 的 教材 、 一 流 的 译 者 、 严 格 的 审 校 、 精 细 的 编辑 ， 这 些 因素 使 我 们 的 
图 书 有 了 质量 的 保证 。 随 着 电气 与 电子 信息 学 科 建 设 的 不 断 完 善 和 教材 改革 的 逐渐 深化 ， 教 
育 界 对 国外 电子 与 电气 工程 教材 的 需求 和 应 用 都 将 步 人 一 个 新 的 阶段 ， 我 们 的 目标 是 尽 善 尽 
美 ， 而 反馈 的 意见 正 是 我 们 达到 这 一 终极 目标 的 重要 帮助 。 华 章 公 司 欢 迎 老 师 和 读者 对 我 们 
的 工作 提出 建议 或 给 予 指正 ， 我 们 的 联系 方法 如 下 : 


华章 网 站 : www.hzbook.com 

电子 邮件 : hzit@hzbook.com 

联系 电话 : (010 ) 88379604 

联系 地 址 : 北京 市 西城 区 百 万 庄 南 街 1 号 
邮政 编码 : 100037 华章 科技 图 书 出 版 中 心 
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本 书 于 1995 年 出 版 了 第 1 版 ， 是 一 本 全 面 介绍 数字 视频 处 理 的 教材 。 其 中 根据 视频 
处 理 领域 的 重要 论题 分 成 了 25 章 ， 在 一 个 学 期 的 课程 中 ， 每 章 可 以 用 一 到 两 次 课 进行 讲 
授 。 那 个 时 期 的 数字 视频 技术 和 视频 处 理 算 法 还 不 太 成 熟 ， 数 码 摄像 机 和 DVD 刚刚 商业 
化 ， 数 字 电 视 标准 正在 制定 ， 而 数字 电影 则 还 没有 纳入 议程 。 因 此 ， 与 当今 的 技术 水 平 
相 比 ， 第 1 版 提 到 的 一 些 方法 /算法 和 技术 已 经 过 时 ， 比 如 像素 级 递归 的 运动 估计 、 矢 
量 量 化 、 不 规则 形状 压缩 、 基 于 模型 的 编码 等 已 不 再 先进 ， 还 有 一 些 诸如 模拟 视频 / 电 
视 和 128K 可 视 电 话 等 技术 则 已 经 淘汰 了 ; 同时 近 20 年 来 此 领域 的 重大 进展 显然 也 无 法 
体现 出 来 。 

第 1 版 出 版 至 今 已 有 20 多 年 了 ， 在 当今 这 个 数字 化 时 代 ， 数 字 视 频 已 广泛 应 用 于 我 
们 的 日 常生 活 。 信 号 处 理 与 计算 机 视觉 领域 的 重大 发 展 促进 了 视频 处 理 算法 的 不 断 成 熟 ， 
能 够 应 用 于 不 同 用 途 的 最 常用 又 有 效 的 算法 与 技术 也 更 加 清晰 。 因 此 ， 现 在 是 本 书 推出 
新 版 的 最 好 时 机 。 本 书 围绕 图 像 与 视频 处 理 的 最 新 发 展 进行 了 精心 编排 ,力图 成 为 一 本 
内 容 全 面 、 结 构 严 谨 的 教材 。 

第 2 版 大 幅度 改进 了 内 容 与 表述 的 组 织 方式 ， 包 含 当今 最 先进 的 技术 、 最 有 效 的 算 
法 和 最 新 的 知识 。 全 书 共 分 8 章 ， 每 章 对 应 一 个 主题 ， 分 别 是 多 维 信和 号 与 系统 、 数 字 图 
像 和 视频 、 图 像 滤波 、 运 动 估 计 、 视 频 分 割 与 跟踪 、 视 频 滤 波 、 图 像 压 缩 、 视 频 压 缩 等 ， 
每 个 主题 侧重 介绍 最 有 效 的 技术 。 与 第 1 版 相 比 ， 本 版 不 是 简单 的 内 容 增补 ， 而 是 一 次 
全 新 的 改写 。 

本 书 可 作为 高 年 级 本 科 生 或 研究 生 的 数字 图 像 与 视频 处 理 课程 的 教材 ， 要 求 读者 预 
先 掌 握 微 积 分 、 线 性 代数 、 概 率 论 和 一 些 基本 的 数字 信和 号 处 理 概 念 。 具 有 计算 机 科学 背 
景 但 不 熟悉 信号 处 理 基本 概念 的 读者 可 以 跳 过 第 1 章 ， 从 第 2 章 开始 学 习 。 尽 管 本 书 表 
述 严谨 ， 但 仍然 像 一 般 教 材 一 样 从 原理 开始 讲 起 ， 因 此 也 可 以 用 作 产 业界 或 学 术 界 的 工 
程 师 和 研究 人 员 自 学 的 参考 书 。 本 书 可 帮助 读者 理解 图 像 和 视频 处 理 方法 的 理论 基础 ; 
学 习 用 最 常用 、 最 有 效 的 算法 解决 常见 的 图 像 与 视频 处 理 问题 ; 通过 每 章 最 后 的 习题 集 
和 MATLAB 项 目 ， 可 加 深 对 知识 的 理解 和 方法 的 掌握 。 

数字 视频 处 理 就 是 对 数字 视频 比特 流 的 各 种 操作 。 所 有 的 数字 视频 应 用 都 离 不 开 
压缩 。 此 外 ,为 了 获得 高 质量 图 像 或 提取 特定 信息 ， 数 字 视 频 应 用 也 离 不 开 广泛 应 用 于 
格式 转换 、 增 强 、 复 原 、 超 分 辨 率 重建 等 场合 的 滤波 处 理 ; 有 些 应 用 还 需要 用 到 其 他 的 
处 理 ， 以 实现 运动 估计 、 视 频 分 制 和 3D 场景 分 析 。 视 频 的 帧 与 帧 之 间 存 在 着 大 量 的 时 


间 相 关 性 ( 元 余 )， 这 使 得 视频 处 理 不 同 于 静态 图 像 处 理 。 可 以 将 视频 看 成 是 静态 图 像 
序列 ， 并 逐 帧 独立 处 理 ; 但 若 采 用 基于 帧 间 相 关 性 的 多 帧 联合 处 理 技术 ,我 们 能 够 开发 
出 更 有 效 的 视频 处 理 算 法 ， 例 如 运动 补偿 滤波 和 预测 。 此 外 ， 某 些 任务 ， 比 如 运动 估计 
或 动态 场景 分 析 ， 显 然 是 无 法 基于 单个 图 像 来 进行 的 。 

本 书 的 目的 是 为 读者 提供 图 像 ( 单 帧 ) 和 视频 ( 多 帧 ) 处 理 方法 的 数学 基础 。 特 别 
是 ， 本 书 还 回答 了 以 下 基本 问题 : 

。 如 何 从 噪声 中 分 离 出 图 像 (信号 ) ? 

.内 插 、 复 原 和 超 分 辨 率 重 建 之 间 是 否 有 内 在 的 联系 ? 

。 对 于 不 同 的 应 用 ， 该 如 何 估计 2D 和 3D 运动 ? 

。 如 何 将 图 像 和 视频 分 割 成 感 兴趣 的 区 域 ? 

。 如 何 跟 踪 视 频 中 的 对 象 ? 

。 与 图 像 滤波 相 比 ， 视 频 滤 波 问题 是 否 更 趋向 于 适 定 ? 

。 超 分 辩 率 重建 为 何 能 够 实现 ? 

.能 否 从 视频 片段 中 得 到 高 质量 的 静态 图 像 ? 

。 图像 和 视频 压缩 为 什么 能 够 实现 ? 

。 如何 压缩 图 像 和 视频 ? 

。 图像 / 视频 压缩 的 最 新 国际 标准 是 什么 ? 

。3D 视频 表现 和 压缩 的 最 新 标准 是 什么 ? 

图 像 和 视频 处 理 问题 大 都 是 病态 的 ( 欠 定 的 和 /或 对 噪声 敏感 的 )， 并 且 它 们 的 解 都 
依赖 于 某 些 图 像 和 视频 模型 。 在 附录 A 中 讨论 了 用 于 病态 问题 解 的 图 像 建 模 方法 。 实 际 
上 ， 图 像 模型 可 专 分 成 基于 局 部 平滑 的 、 基 于 变换 域 稀疏 的 和 基于 非 局 部 自 相 似 的 等 种 
类 。 

图 像 处 理 算法 大 都 使 用 了 以 上 模型 中 的 一 种 或 多 种 。 此 外 ， 视 频 模型 还 包括 基于 全 
局 平移 或 块 运动 、 参 数 化 运动 、 运 动 ( 空间 上 ) 的 平滑 性 、 时 域 运动 单调 性 ( 时 域 连续 
或 平滑 )、3D 空 — 时 频 域 的 平面 支撑 等 种 类 。 

各 章 概述 如 下 : 

第 1 章 回顾 了 多 维 信号 、 变 换 和 系统 的 基础 知识 ， 它 们 是 许多 图 像 和 视频 处 理 方法 的 
理论 基础 。 我 们 还 介绍 了 空 - 时 采样 的 体制 ( 如 逐 行 和 隔行 采样 )， 以 及 采样 格式 转换 理 
论 。 读 者 如 果 具 有 计算 机 科学 背景 而 只 是 不 熟悉 信号 处 理 概念 ， 则 可 以 跳 过 本 章 ， 直 接 从 
第 2 章 开始 学 习 。 

第 2 章 给 出 了 数字 图 像 与 视频 的 基础 知识 ， 主 要 内 容 包 括 人 类 视觉 、 空 间 频 率 、 彩 
色 模 型 、 模 拟 和 多 视角 视频 表示 、 数 字 视 频 质量 评估 等 基本 概念 ， 以 及 一 些 常见 的 数字 
视频 应 用 ， 如 数字 电视 、 数 字 电影 和 互联 网 视频 流 等 。 
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第 3 章 介 绍 图 像 (静止 帧 ) 滤波 类 问题 ， 比 如 图 像 重 采样 ( 抽取 与 内 插 )、 梯 度 估计 
与 边缘 检测 、 增 强 、 去 品 、 复 原 等 。 还 介绍 了 线性 移 不 变 滤 波 器 、 自 适应 滤波 器 和 非 线 
性 滤波 器 。 附 录 A 中 给 出 了 求解 病态 逆 问 题 的 一 般 性 框架 。 

第 4 章 介绍 2D 和 3D 的 运动 估计 方法 。 运 动 估计 是 数字 视频 处 理 的 核心 ， 因 为 运动 
是 视频 的 显著 特征 ， 并 且 运 动 补偿 滤波 是 利用 时 间 宛 余 的 最 有 效 的 方法 。 再 者 ， 许 多 计 
算 机 视觉 工作 的 第 一 步 都 是 2D 或 3D 的 运动 估计 与 跟踪 。2D 运动 估计 一 般 分 为 稠密 光 
流 估计 或 稀疏 特征 对 应 估计 两 类 ， 可 以 基于 参数 法 和 非 参 数 法 来 实现 。 非 参数 法 包括 基 
于 图 像 梯 度 的 光 流 估计 法 、 块 匹配 法 、 像 素 递归 法 、 贝 叶 斯 法 和 相位 相关 法 。 基 于 仿 射 
模型 或 单 应 性 的 参数 法 可 以 用 于 图 像 配 准 或 局 部 变形 估计 。3D 运动 /结构 估计 法 一 般 都 
基于 双 帧 极 线 约束 法 ( 主要 是 针对 立体 对 的 ) 或 多 帧 因子 分 解法 。 欧 氏 3D 结构 重建 需要 
对 所 有 相机 进行 标定 ， 而 投影 重建 法 则 可 以 无 需 标定 。 

第 5 章 介 绍 图 像 分 割 和 变化 检测 ， 以 及 基于 参数 聚 类 法 和 贝 叶 斯 法 的 主要 运动 或 复 
杂 运 动 分 割 。 我 们 还 讨论 了 运动 估计 与 分 割 的 同时 实现 问题 。 因 为 双 视 角 运动 估 计 技 术 
对 于 图 像 梯 度 或 对 应 点 的 估计 精度 很 敏感 ， 而 对 于 单 视角 长 序列 对 ， 其 分 割 对 象 的 运动 
跟踪 结果 更 鲁 棒 ， 所 以 我 们 也 对 它们 进行 了 相关 讨论 。 

第 6 章 介绍 视频 滤波 ， 包 括 标准 转换 、 去 噪 和 超 分 辩 率 重建 等 内 容 。 首 先 介绍 了 运 
动 补偿 滤波 的 基本 原理 ， 随 后 介绍 了 标准 转换 问题 ， 包 括 帧 速 转换 和 去 隔行 等 。 视 频 帧 
的 画面 中 经 常 存 在 颗粒 ， 尤 其 在 静止 帧 模式 下 观看 时 更 加 严重 。 为 此 ， 讨 论 了 用 于 噪声 
抑制 的 运动 自 适应 和 运动 补偿 滤波 。 最 后 介绍 了 一 种 统一 各 种 视频 滤波 问题 的 综合 模型 ， 
可 用 于 低 分 辩 率 视频 获取 和 超 分 辩 率 重建 。 

第 7 章 介绍 包括 二 值 图 像 〈 传真 ) 和 灰 度 图 像 在 内 的 静态 图 像 压 缩 方 法 与 标准 ， 如 
JPEG 和 JPEG 2000 等 。 还 特别 讨论 了 有 损 的 离散 余弦 变换 编码 和 小 波 变换 编码 等 方法 。 

第 8 章 讨 论 视 频 压 缩 方 法 和 标准 ， 它 们 是 实现 数字 电视 、 数 字 电 影 等 数字 视频 应 用 
的 基础 。 在 简要 介绍 视频 压缩 的 不 同方 法 后 ， 详 细 描述 MPEG-2、AVC/H.264 和 HEVC 
等 标准 ， 以 及 这 些 标准 在 可 伸缩 视频 编码 和 立体 / 多 视角 视频 编码 方面 的 扩展 。 

本 教材 是 近 20 多 年 来 我 在 数字 图 像 与 视频 处 理 领 域 的 教学 结晶 。 本 书 内 容 丰 富 、 组 
织 严 说， 全 面 覆盖 了 图 像 滤波 、 运 动 估计 与 跟踪 、 图 像 / 视频 分 割 、 视 频 滤 波 、 图 像 / 视 
频 压 缩 等 方面 的 基本 原理 和 最 新 成 就 。 然 而 ， 一 本 教材 无 法 覆盖 数字 视频 处 理 和 计算 机 
视觉 领域 所 有 的 最 新 成 就 ， 因 此 本 书 只 对 最 基本 、 最 常用 的 技术 和 算法 加 以 详解 ， 而 对 
更 多 的 先进 算法 和 最 新 研究 成 果 只 进行 简介 ， 并 提供 用 于 自学 的 参考 文献 。 每 章 最 后 都 
包含 问题 集 和 MATLAB 项 目 ， 以 便 读 者 对 所 学 到 的 方法 进行 练习 。 

教师 可 以 通过 申请 获得 教学 资料 。 根 据 各 校 的 课时 安排 可 在 一 个 学 期 的 数字 图 像 与 
视频 处 理 课程 中 讲 完 本 书 的 全 部 内 容 。 另 一 种 方式 是 将 本 书 内 容 分 到 两 个 学 期 中 ， 这 样 
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就 有 更 多 的 时 间 对 每 个 主题 的 细节 进行 探讨 : 第 一 学 期 可 以 开设 数字 图 像 处 理 课程 ， 讲 
解 第 1 ~ 3 章 、 第 7 章 的 内 容 ; 第 二 学 期 后 续 开 设 数字 视频 处 理 课 程 ， 讲 解 第 4 ~ 6 章 、 
第 8 章 的 内 容 。 

显然 ， 本 书 是 信号 处 理 和 计算 机 科学 相关 组 织 研究 成 果 的 蔡琳 。 每 章 都 有 很 多 引用 
并 列 出 了 相关 参考 文献 ， 但 肯定 无 法 涵盖 图 像 与 视频 领域 科研 与 工业 部 门 杰出 研究 者 的 
所 有 成 就 。 此 外 ， 对 ISO MITU 组 织 中 各 位 科学 家 经 多 年 工作 取得 的 图 像 与 视频 编码 的 
显著 成 果 ， 在 这 里 也 难以 一 一 致意 。 

最 后 ， 衷 心 感谢 Xin Li( 美国 西 弗 吉 尼 亚 大 学 , WVU )、Eli Saber, Moncef Gabbouj, 
Janusz Konrad 和 H.Joel Trussell 在 本 书 成 稿 过 程 中 的 贡献 。 同 时 感谢 Prentice Hall 出 版 
社 的 Bernard Goodwin、Kim Boedigheimer 和 Julie Nahil 的 帮助 与 支持 。 


——A. Murat Tekalp 
于 土耳其 ， 伊斯坦布尔 ，Koc 大 学 
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第 1 章 多 维 信号 与 系统 

.1 多 维 信号 

1.1.1 有 限 域 信号 和 周期 信号 

1.1.2 ”对 称 信号 

1.1.3 特殊 的 多 维 信和 号 

12 ”多维 变换 

1.2.1 连续 信号 的 傅 里 叶 变 换 
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CHAPTER 1 


Multi-Dimensional Signals 
and Systems 





There are some fundamental differences between the theory of one-dimensional 
and multi-dimensional signal processing that arise from the related facts that 
zeros and poles of the multi-dimensional z-transform are not isolated points 
but functions, and multi-variable polynomials in general do not factor. 


The theory of multi-dimensional (MD) signals and systems is fundamental to under- 
standing digital image and video processing. Digital images are two-dimensional 
(2D) sequences (partially ordered) with two discrete spatial variables; they can also 
be represented by arrays (vectors or matrices). Digital video is a three-dimensional 
(3D) function of two spatial and one temporal variable, and 3D video is a four- 
dimensional (4D) function of three spatial and one temporal variable. Digital filters 
to process these signals are MD systems. MD transforms help to understand spatial 
and temporal frequency concepts, and physical and normalized frequency variables. 
This chapter introduces the fundamental concepts of MD signals, transforms, and 
systems, as well as sampling MD analog signals on lattices and sampling structure 
conversion, which are essential for digital image and video processing. 

There are some fundamental differences between the theory of 1D and MD sig- 
nal processing, which arise from: i) zeros and poles of the MD z-transform are not 
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isolated points but functions, and ii) multi-variable polynomials in general do not 
factor. As a result, some 1D signal-processing methods do not readily generalize to 
multi dimensions, but at the same time, it is possible to develop new algorithms in 
multi dimensions that do not have 1D counterparts. In the following, we present 
the main results for MD signals, systems, and transforms in 2D notation, since it is 
easier to visualize in 2D. Most definitions and results for 2D signal processing can 
be readily generalized to more than two dimensions, although the notation becomes 
more cumbersome and abstract and visualization may not be possible. 


1.1 Multi-Dimensional Signals 


An MD signal is a multi-variable function or sequence of M = 2 independent con- 
tinuous, discrete, or mixed variables. 


Definition: An analog MD signal s(x) = s(x,,x,,...,x,,) is a function of 
M continuous variables, where x = [x] x,... 2A. The function s(x) may be 
scalar valued (e.g., gray-scale image) or vector valued (e.g., color image). 


Definition: A discrete MD signal s(n) = s(n, n»... ny) is an MD 
sequence defined over a lattice, which is a partially ordered set of tuples of 
integers n = [”,7,...n he Again, s(n) may be scalar or vector valued. 


Definition: A mixed MD signal is a function of M variables some of 
which are continuous, while others are discrete. An analog-video signal is a 
mixed signal. 


1.1.1 Finite-Extent Signals and Periodic Signals 


This section discusses MD finite-extent signals and MD periodic signals that are 
isomorphic to each other. 


Finite-Extent Signals 


Since cameras have finite-size sensors and video is recorded over a finite duration, 
most MD signals of interest are finite extent; i.e., they are defined only within a finite 
region in space/time, called the “support.” The support of a signal may have an MD 
rectangular or arbitrary shape. 

A 2D signal is said to have wedge support if it is defined only within two lines 
emanating from the origin, as shown in Figure 1.1(d). Quarter-plane (Figure 1.1(a)), 
half-plane (Figure 1.1(b)), and non-symmetric half-plane (NSHP) (Figure 1.1(c)) 


supports are special cases of wedge support. 
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(9 (d) 


Figure 1.1 Support of 2D signals: (a) quarter-plane support; (b) half-plane support; 
(c) non-symmetric half-plane support; and (d) wedge support. 


Example. Digital images have finite quarter-plane support, which 
is depicted in Figure 1.1(a), i.e., they are defined within a rectangle 
S29, SEN- 1 and 0 = n> S where N, and N, denote the hori- 
zontal and vertical size of an image in pixels. 


Periodic Signals 


An MD sequence (mn), where n = [n] 7... mA is said to be periodic with the 
periodicity matrix N if it satisfies 


s(n)= 5(n + Nr) (1.1a) 


for all integer-valued M-vectors r, where N is an M X M periodicity matrix, such 
that det(N) Æ 0. 


A 
For 2D signals r= i 
P 


2 
tion (1.la) can be expressed in scalar form as 








and N=[n, |n,] o , and the vector equa- 


Ny Nal 


s(n n) = s(n, + rN ip% + rN,,) with r, = 0 (1.1b) 


= s(n + rN ipt + 7,N,,) with r, = 0 
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The integer-valued vectors n, = (NV,,,N,,) and n, = CVD) denote displace- 
ments from any sample in one period to the corresponding samples in two other 
periods. In other words, the 2D signal repeats itself at all integer multiples of the 
shift vectors n, and n, as depicted in Figure 1.2. The period defines a unit cell in the 
(n,,m,) plane, which repeats over the whole plane. We note that N is not unique; 
two arbitrarily chosen linearly independent vectors n, and n, that point to the same 
sample in two different periods can be used to represent the same periodicity pattern. 

A special case arises when the periodicity matrix is diagonal, given by 


1 

0 N, 
n, and n, lie along the horizontal and vertical axes. Then, s(n1, n,) is said to be rect- 
angularly periodic and satisfies 


N= , which is illustrated in Figure 1.2(b). In this case, the shift vectors 





s(n n) = s(n, + Npn) = s(n n, + N,) for all (n,n,) 
Example. A 2D discrete complex exponential signal 
了 (7 ,7 ) 一 et Oe) 
is rectangularly periodic in (7,,”,) with period (Np N) if 
27 N, si 27 N, 


where N', N,, &, and k, are unitless integers, and the units of œ, and w, are 
radians. 





Figure 1.2 Periodicity in 2D: (a) general periodicity defined by vectors m, and n, 
and (b) rectangular periodicity. 
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Periodic signals and finite-extent signals are isomorphic to one another, since 
given a rectangularly periodic signal, the main period of the periodic signal can be 
defined as a finite-extent signal, and given a finite-extent signal we can define a peri- 
odic signal through periodic extension 


Frot) = Do os (iN — hN) (1.2) 


1.1.2 Symmetric Signals 

There are multiple forms of symmetry in two or more dimensions. Most common 
forms of 2D symmetry are two-fold, four-fold, or circular symmetry. Symmetric 
signals can be finite or infinite extent. In the following, we define symmetry with 
respect to origin. Symmetry with respect to an arbitrary point can be defined by an 
appropriate shift of the coordinates. 

Two-Fold Symmetry 

Two-fold rectangular symmetry is also called non-symmetric half-plane symmetry 
and is given by 


(apn) = s(— hy, A) (1.3a) 


The distinct coefficients are depicted by dots in Figure 1.1(c). The remaining 
coefficients to complete the square support are determined by the symmetry. 


Four-Fold Symmetry 

A more strict form of symmetry is the four-fold rectangular symmetry given by 
snm) = nym) = sh, — 2) (1.3b) 

The support of the distinct coefficients in this case is a quarter-plane. 


Circular Symmetry 


A signal s(,,7,) is circularly symmetric if it is only a function of distance 7? + 7; 
from the origin. Circular symmetry implies four-fold symmetry. 


1.1.3 Special Multi-Dimensional Signals 


Some special signals play an important role in understanding image- and video- 
processing filters. They are separable signals, spatial-frequency patterns, MD impulse, 
and MD unit step signals, which are introduced in the following. 
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Separable Signals 
An MD signal (function) is separable if 


Sys fasia yy) = AA O) se PEPP, (1.4) 


We can make the following observation in the case of 2D signals. A finite-support 
2D signal s(z,,7,) can be represented by a matrix S. If the signal is separable, then 
the matrix S can be written as the outer product S = s,s/, where the vectors sl and s, 
denote samples of 1D signals s,(7,) and s,(7,), respectively. We note that while a gen- 
eral N, X N, matrix has N,N, degrees of freedom, the outer product has N, + N, 
degrees of freedom. 


Example. A 2D discrete complex exponential signal is separable, since 


(wm +m) Jom, 


s(n,,n,) =e" = gig — ¢ (9, s.r) 


Spatial-Frequency Patterns 


We can define the horizontal spatial-frequency pattern 
s(n, n,) = cos(@,7,) 

which is the same for all rows (image lines), or the vertical spatial-frequency pattern 
s(n, n,) = cos(@,n,) 


which is the same for all columns, or the spatial-frequency pattern with 45-degree 
angular orientation 


s(n,,n5) = cos(w (2i = Tigh) 


MD Impulse 
An MD impulse is defined by 


1 二 二 ww =0 
Bm) =| ree si (1.5a) 


0 otherwise 
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(1,1) 





Figure 1.3 Two-dimensional signal representation in terms of 2D impulses. 


A 2D impulse denotes a point source with unity amplitude on the plane. 
Any discrete image (2D signal) can be expressed as a sum of shifted and weighted 
impulses as 


s(n, m) = Sages Pala s[h,,k,]5(2, —k,,2, — k) (1.5b) 


which is a generic 2D signal representation used in some derivations. This is illus- 
trated in Figure 1.3. 


Example. The 2D impulse is separable, since we can write 
5(n,,7,) = 6(n)6(n,) 


L a=0 


is the 1D Kronecker delta function. 
0 otherwise 


where 6,(7,) = | 


MD Unit Step 


An MD unit step is often used to indicate the support of an MD signal or the 
impulse response of an MD system. It can be defined on any support. For example, 
referring to Figure 1.1, we can define a 2D unit step with quarter-plane, half-plane, 
or wedge support. 


Example. 2D unit step with quarter-plane support is defined as 


1 “4=0, 4,20 
0 otherwise 


Upp (mm) -| 
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2D unit step with half-plane support is defined as 


1 w=0 


0 otherwise 


| 


The sample values are “1” on the dots shown in Figure 1.1(a) and (b), respec- 
tively, and zero elsewhere. 


1.2 Multi-Dimensional Transforms 


Signals, just like vectors, can be expanded onto some basis functions. The MD con- 
tinuous (discrete) Fourier transform expands MD continuous (discrete) signals onto 
MD continuous (discrete) complex exponentials, which form an orthogonal basis. 
Here, we only present 2D Fourier transform for notational simplicity; however, all 
definitions and properties of 2D Fourier transform generalize to MD Fourier trans- 
form. Our discussion of 2D wavelet transform is delayed until we introduce finite 
impulse response (FIR) image filtering and multi-resolution/multi-scale image rep- 
resentations in Chapter 3. 


1.2.1 Fourier Transform of Continuous Signals 


While digital images and video are functions of discrete variables (sequences), we 
start this section with the Fourier transform of functions of continuous variables to 
introduce the notation used in Section 1.4, since signals with discrete variables are 
obtained by sampling signals with continuous variables. 


Definition: The Fourier transform of a 2D continuous signal s(x,,x,) is 
given by 


S (mu) = SS. r Siaa) ES) de de, (1.6a) 


and the inverse 2D Fourier transform is given by 
1 j (mx Hux 
$(X,5%,) = aF S Sira rm) ele) du du, (1.6b) 


where u, and u, denote real spatial-frequency variables with units cycles/ 
distance. We can also have a temporal variable, e.g., s(x,,x,,t). Then, the 
unit for temporal frequency is Hz. 


The 2D Fourier transform is complex: 


Susu) =|S(u,,u,) |e" = Sp (u) + JS; (4p) (1.6c) 
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where S,(u,,u,) and S,(u,,~,) denote the real and imaginary parts, respectively, 
and |S(w,,~,)| and O(nu,u,) denote the Fourier magnitude and Fourier phase, 
respectively. 


Convergence of the Fourier Transform 


The 2D signal s(x,,x,) may have an infinite extent. Hence, we need to consider the 
conditions under which the double integral (1.6a) exists: 


e Uniform convergence: The Fourier transform exists and is a continu- 


ous function of zj and v, (i.e., the double integral converges uniformly) if 


rey | s(x, 52.) | abe, dx, < %; i.e., s(x,,x,) is absolutely integrable. 


。 Mean-square convergence: If S(u,,u,) exists but has discontinuities, then a 
weaker form of convergence, called mean-square convergence, applies. For 
example, s(x,,x,) = (sinx,)/x, (sinx,)/x, is not absolutely integrable, but its 
Fourier transform converges in the mean-square sense. Then, we observe the 
Gibbs effect around points of discontinuity. 

。 Generalized convergence: In this case, neither uniform nor mean-square con- 
vergence applies, but S(u,,~,) may still be defined using the Dirac delta 
function 6(u,,u,). For example, s(x,,x,) = 1 for all (x,,x,) is not absolutely 
integrable, but its Fourier transform is defined in the generalized sense as 


S(u,,u,) = 5(u,, u). 


Properties of Multi-Dimensional Fourier Transform of Continuous Signals 


This section discusses some unique properties of the MD Fourier transform that do 
not have 1D counterparts. 


Decomposition into LD Transforms 


MD Fourier transform can be decomposed into a series of 1D Fourier transforms 
over each variable separately, which is demonstrated here for the case of 2D. Eqn. 
(1.6a) can be rewritten as 


S(u,,u,) = +E hie, e Pe Ot de de, 
(x4 ,x2 ) 


=f iS. seme a da em dx, 


=f s je ae, 
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which shows that it can be evaluated as a 1D transform over x, followed by another 
1D transform of the intermediate result S(u,,x,) over x,. 


Affine Transformation of Coordinates 
Let 
gp) = Max, + Oe, FG ay + ee, + fF) 


denote an affine transformation of coordinates. If S(z,,,) is the 2D Fourier trans- 
form of s(x,,x,), then the 2D Fourier transform of g(x,,x,) is given by 


[Cee bf )u +( af —cd ) 
Gl ) 1 Se eu, — du, —bu,+au, 
hit; ) = ———; & 
ae 


一 一 -一 一 ， 
= ER ney 


Proof 


The Fourier transform is given by 


Glum) = |f slan + bx, + esd tee, + f) PR) dhe, de, 


In order to perform a change of variables, we can first write the affine mapping in 
vector-matrix notation as 


/ 
1 


a b 
d e 


i X c 


ok 
f 


























/ 
xX, xX, 


and the Jacobian relation as dx/dx; =|A |dxidx,, where A = ae — bd. Provided that 


A # 0, the inverse relation is given by 


























-1 
sc Ie b x! 一 6 
时 / 
x| |d el |n—f 
Then, y 
1 
HX Tlk = [u u, | 
p 
a = 
| j7 bT |x; | a bl'fe 
| —|u u, | 
1 1 2 
d e d F 


























1.2 Multi-Dimensional Transforms 11 








and 

‘ [(ec—bf )u +(af —cd)u,| ` [Ceu duty )x{+(—bu +au )x| / / 

Don MIM NG TOMA) ia dx dx 
G(u,,u,) =ff s(x1,25) e aehd e ae 一 一 

ae — bd 
1 | eu, — du, —bu, + au, 
ae— bd ae—bd ae—bd 
Special Cases 
1. Translation: a = e = 1, b = d = 0. 
Guy u) = eP #544, u) (1.7b) 


which states that translation in the spatial coordinates results in a phase- 
shift in the Fourier domain as expected. 
2. Rotation: c = f =0, a =e = cosh, d =—b = sinb, where ae 一 bd = 1. 


G(u,,u,) = S(u, cos — u, sinb, u sin? + u, cos@) (1.7¢) 
which states that a rotation of spatial coordinates results in a rotation of 


Fourier coordinates by an equal angle. Hence, the Fourier transform is 
rotation invariant. 


Projection Slice Theorem 


Let s(x,,x,) have Fourier transform S(u,,u,), and p,(x,) denote the Radon trans- 
form of s(x,,x,) defined by 


Px) =f s(x, cos@ + x, sinĝ,— x, sin@ + x, cos) dx, (1.8a) 


which projects s(x,,x,) to a line through the origin with angle 0. Then, 
P(Q) = S(Q2 cos, 2 sind) (1.8b) 


where P(Q) denotes the 1D Fourier transform of p,(x,) for each angle 0. 

In words, the projection slice theorem says that taking a function s(x,,x,) and 
first projecting it to a line through the origin with angle 0 and then taking its 1D 
Fourier transform is equivalent to taking the 2D Fourier transform of s(x,,x,) and 
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evaluating the Fourier transform S(w,,~,) on the line through the origin with angle 0. 
This can be easily proven for the case 0 = 0. Projection of s(x,,x,) on the horizontal 


axis is defined by 


oo 


Po(x,) = fox) de, 


—oo 


Taking the 1D Fourier transform of the projection p)(x,) yields 


P (u) = f J y(n Jem dx, 


—æ |—« 


= 人 = S(u,,0) 


一 oo 一 o0 


which provides the desired result. The projection-slice theorem is fundamental to 
how several medical imaging modalities (e.g., computer tomography) work. 


1.2.2 Fourier Transform of Discrete Signals 


Many properties of the MD Fourier transform of continuous signals also hold for 
MD Fourier transform of discrete signals. However, there are also some important 
differences. Most important among these is that the Fourier transform of MD dis- 
crete signals is rectangularly periodic. 


Definition: The Fourier transform of a 2D discrete signal s(m,,7,) is a 
rectangularly periodic function of “normalized” continuous frequency vari- 
ables w, and w,, where 


He ei ) = 2S spe eg a s(n, , 1, Je Hom +a ny) (1 .9a) 
is periodic with period 27 X 27, and the inverse 2D transform is given by 
1 T T z A z 
a jo „j \pjlomtom) 
s(n n) ny Jf se ,27 )e dw, dw, (1.9b) 


The same concepts of uniform convergence, mean-square convergence, and gen- 
eralized convergence that have been discussed in Section 1.2.1 also applyhere, butthis 
time in the context of convergence of infinite series rather than improper integrals. 
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Properties of the 2D Fourier Transform of Discrete Signals 


bs 


The Fourier transform S (at oo ) is periodic with the period 27 in both w, 
and @,: 


S(e™ eit ) = Se 2 ) = sie" Rede) (1.10) 


Therefore, S (a ie) is usually specified in the unit cell [~ m, m) X [—7, 7). 
The Fourier transform is complex: 


els (w02) 


S(e* ei” ) = |s(e”™ jer ) 





= Sk ‘Cm en” ) F jS; (e7 yer ) 


where Sp (ee) and S, (ea) denote the real and imaginary parts, 


denotes the Fourier magnitude, and 6,(@,,@,) 





respectively, |s (er 这 ) 


denotes the Fourier phase. Note that the imaginary part S, (e pef") is real. 


Given a 2D complex signal s(”,, n,), we can define its conjugate symmetric part 
s(n,» n,) and conjugate anti-symmetric part s (11, 7)) as 


s, (m,m) = Yo {s(m sm) +5*(—n,,—m,)} (1.11a) 
nam) = Vo {s(m,m,)—5*(—m,—n,)} (1.11b) 


Then, we have the following Fourier transform pairs: 


aes 
s,(mm) > Spe? Ea i 


s (mn) +> jS; (e em) 


If s(n, 7)) is real, then S (i P is Hermitian symmetric, i.e., 
SATa aa = Sarg) (1.12) 


which implies that the magnitude is an even function and the phase is an odd 
function, i.e., |S (e eu = oe aed and 6,(@,,@,) = —0,(—@,,—@,). 

It also implies that the real part is even and the imaginary part is odd, i.e., 
Sy (ei eo) =S, Tii eo) and S, Ti e) =-—S, (eo Ye. ). 
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Since we have two-fold symmetry, it is sufficient to specify the value of 
S (ez et) over the NSHP support depicted in Figure 1.1(c). 


5. Spatial-domain symmetry and zero Fourier-phase: A signal s(,,7,) has zero 
Fourier-phase if S ta so) is real and positive. If S Ca as ) is real but neg- 
ative for some (@,,@,), then it has a phase of 180 degrees (or 7 radians) for 
those (@,,@,) where it is negative. In either case, S (em a) =§ y Gii Fad) 
implies that s(74,7%,) = s * (~m,—7,), which is known as two-fold spatial sym- 
metry. A stronger symmetry is four-fold symmetry, which (for real-valued sig- 
nals) can be stated as 5(7 ,72 ) = s(—n,,,) = s(m,,—n,). In the Fourier domain, 
four-fold symmetry implies S(e pi atn) = nee | = S(e oo ). 


1.2.3 Discrete Fourier Transform (DFT) 


The Fourier transform S tem ee] of a 2D discrete signal is a function of two con- 
tinuous variables —7 Sw, <m and —7 S w, < 7. For a digital representation, the 
frequency variables must be discretized or sampled. It turns out that this is only pos- 
sible for finite-extent discrete signals. Since finite-extent signals and periodic signals 
are isomorphic to each other, the DFT of s(n1,7,) with N, X N, samples is equal 
to the discrete Fourier series coefficients of the periodically extended signal 5(7,,7,) 
with period at least NV, X N,. The Fourier series coefficients themselves are periodic 
with period N, X N,. The main period of the Fourier series coefficients is defined as 
the DFT of the finite-extent signal, given by 





Pan 427m, 
SC 局 j= np Boas Sh ne j > 
= h<N,-10S 4,=N,-1 (1.13) 


Alternatively, it is possible to obtain the DFT S(&,,£,) by sampling the Fourier 
transform S$ [en a] with N, X N, samples over the unit cell [0, 27) X [0, 277). 
Given that s(7,, n,) is finite extent with at most N, X N, samples, there will be no 
spatial-domain aliasing due to sampling of the Panier tieit S (e is em". On 
the other hand, if we sample the Fourier transform of an ie sequence 
s(11, 17), or if s(7,,,) has more than N, X N, samples, then $(n,n,), obtained by 
computing the N, X N, point inverse DFT of these samples S(&,,£,), will exhibit 
space-domain aliasing, where (n,n,) is related to s(n1, n,) by 
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Sam) = D D iN m hNa) 


4, =—0 b =—0 
Convergence 


There is no issue of convergence with the DFT, since Eqn. (1.13) involves finite 
summations only. 


Normalized Frequency Variables 


The variables (k,,4,) are unitless indices, while the unit of the variables (w,,@,) is 
radians. They are related to the physical frequency variables (zi, zx?) whose unit is 
cycles/distance by 

_ @, _. 2k, 


~ Ax, N,Ax, 











(1.14) 


where Ax, and Ax, are sampling distances in the horizontal and vertical dimensions 
in the spatial coordinates. 


Computation of the 2D DFT and Inverse 2D DFT 


Since 2D complex exponentials are separable, 2D DFT can be computed as a cascade 
of two 1D DFTs, first on the rows of s(7,, n,) then on the columns of S(7,,,) as 


21k, 21k, 

















N,-1[ Nz- -jah p, 1 
Shik) =Y |Y mm)e M e ™ 
m=0| m=0 
where 
N,—1 oTi i 
S(n,k) = s(n, m)e Sa 
m=0 


is the 1D DFT over the row n, of the image. Note that 1D DFTs are computed by 
using the Fast Fourier Transform (FFT) algorithm. 

Inverse 2D DFT can be computed using the forward FFT algorithm, as illus- 
trated in Figure 1.4, by first conjugating S(£,,£,), then computing 2D forward DFT, 
and finally again taking the complex conjugate of the result, since we have 














21k, 27k. 
Ni -IN 一 ym mt 
1 2 
slnn) = w 
2 k=0 k= 
1 N,-1N,-1 21k, 427k, | 





= S*(kkyJe a: 
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SP) Cia 1/N,N;( 过 s(n m) 


Figure 1.4 Computation of 2D inverse DFT using FFT algorithms. 


Properties of the DFT 
1. The 2D DFT is rectangularly periodic with period N, X N,, i.e., 


Slk, k,) = Slk, + N, sk) = Slk k, + N,) for all (k,,k,) 


The 2D DFT S(k,, k,) is complex valued and can be expressed as 


S(k,,k,) =|S(k k, ) |e?” 
= S,(k,k) + jS, (kok) 


where |S(k,k,)| and @(k,,k,) denote the magnitude and phase of the DFT, 
and S,(k,,,) and S(k,,&,) are the real and imaginary parts, respectively. 

2. If we define the conjugate symmetric part s(7,,m,) and conjugate anti- 
symmetric part s(n1, 7,) of a complex signal s(n1, 1,) as in (1.11), then we have 
the following Fourier transform pairs: 


5,(m,,2,) > Sr(hs &) 
5,(m,n,) > IS (k ky) 


3. For real-valued signals s(z,,7,), the DFT is Hermitian symmetric, which 
means 


Stk,» ky) = S*(—k,, —k,) 


or equivalently |S(k,, k,)| =|S(-A,,—#)| and B(k,, k,) =—B(-£,, —&,). 
Because of this symmetry, the total number of non-redundant DFT magni- 
tude and phase samples is equal to the number of image samples. An example for 
N, = N, = 6 is shown in Figure 1.5. Note that &(0,0) = —&(0,0) = 0; 
(3,0) = —H(—3,0) = —H(—3 + 6,0) = 0; $(0, 3) = —H(0, —3 + 6) = 
—(0,3) = 0; and &(3,3) = —&(—3,—3) = 0 because of the periodicity 
of the DFT. There are only 20 distinct DFT magnitude coefficients and 16 
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boo Pad SSO, BEDI [52,0 ~ |s0,0)| 
boD (Sav) SAH) SEH, ES KEDI 
|S@,2)| BEJ Bed SBD) BEDI 186.2] 
Kosi Ba BEJ BEJ 23) 03) 
|S(O, 2)| |S(5,2)| |S(4, 2)| |S(3, 2)| |S(2, 2)| |S(1, 2)| 


ISO, 6D KED SBD BED ssa, DI 
(a) 


9 01,0) $02,0) 0 -8(2,0) —8(1,0) 
0,1) Say SOY 861) B41) 6,1) 
(0, 2) (1,2) #(2, 2) (3,2) (4,2) (5,2) 

9 Da3) 有 03) o —£(2,3) —8(1,3) 
—(0,2) —G(5,2) —6(4,2) —&(3,2) —(2,2)  —(1,2) 
—&(0,1) —&(5,1) —&(4,1) —&(3,1) —(2,1) —&(1,1) 

(b) 
Figure 1.5 Hermitian symmetry of S(k,, k) for real (m,n,). Gray-shaded 


coefficients are distinct, and others are determined by the symmetry. 
(a) The magnitude is even; (b) the phase is odd. 


distinct DFT phase coefficients, for a total of 6 X 6 = 36 distinct DFT coef- 
ficients for a 36-pixel image. 

4. Circular Shift: Since s(2,, n,) is N, X N, finite-extent signal, any shift of coor- 
dinates must be such that the shifted signal must remain within the original 
N, X N, support. This is called a circular shift and can be defined by using the 
modulo operation ()w. We have the following Fourier transform relation: 





BEA 
s((m—M,)y,> (m — My) wv, ) > Sh sky )e (h ne 
5. Shift of Origin: It is usually desirable to shift the origin to the center of the 
image to obtain more visually pleasing 2D plots. This can be achieved by mul- 


tiplying the image 


Sk ot dy ee) (1.15) 
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prior to computing the DFT. Using the frequency shifting property of the Fou- 
rier transform 


.27 N, .27 N3 
Frigg Old gg wee N N. 
Nz 2 1 2 
eM? @ Nz? dam) S| k, ——,.k, = 
2 2 
since 
.27 N, ,27 N, 
Jy 2" JN, 2” jm (n+) (m+n) 
区 =e" =(-1) 


6. Parseval’s Theorem: DFT preserves energy; i.e., the energy in the spatio-temporal 
domain is unchanged in the Fourier domain, which can be expressed as 


1 
NN, 





Dr Dim) = Dry Srey A AN (1.16) 


1.2.4 Discrete Cosine Transform (DCT) 


The DCT (or its integer approximations) is often the transform of choice in the 
state-of-the-art image and video-compression standards for the following reasons: 
i) Unlike the DFT, it is real valued for real-valued images (note that the DCT is not 
the same as the real part of the DFT); ii) it can be computed by FFT algorithms 
(developed to compute the DFT) via a symmetric extension of the signal; and iii) the 
high-frequency coefficients of the DCT contains less energy compared with those 
of the DFT, since intensity discontinuity at the image boundaries due to implicit 
periodic extension in DFT is alleviated by the symmetric extension in DCT. 

Similar to 2D-DFT, 2D-DCT basis functions are separable; hence, the computa- 
tion of 2D-DCT and inverse DCT can be performed as a cascade of two 1D-DCTs 
just as in the 2D-DFT. That is, we first compute the 1D-DCT of columns of the 
image, followed by the 1D-DCT of rows of the intermediate result. 

There are eight types of 1D-DCT depending on the method used for the sym- 
metric extension. DCT types I-IV treat right and left boundaries consistently 
regarding the point of symmetry; i.e., they are even/odd around either a data point 
or halfway between two data points for both boundaries. On the other hand, in 
DCT types V-VIII, the symmetry type alternates between right and left boundaries; 
i.e., it is even/odd around a data point for one boundary and halfway between two 
data points for the other boundary. 


1.2 Multi-Dimensional Transforms 19 


The most common form of DCT is the type II DCT where both boundaries are 
symmetric with respect to halfway between two samples; i.e., the last sample repeats. 
Type II 2D-DCT is given by 














z 2 Tk Tk 
Clk ok) = ONG Oe 4 son )eo| Tk (2, +1) os| zh (2, +1) 
VER ANH LIEREN S I (1.17) 


and the inverse 2D DCT is given by 





N,-1N3-—1 
dne SO Silk wlk) Cikk) 


m=0 m=0 








cos Tk (2n, +1) 
ZIN. 


1 





Tk 
co Sar (2n, +1) 


2 





where 


kx0 


-| k=0 
1 


Relationship to the DFT 


Each N-point 1D-DCT over the columns (rows) can be computed by a 2/V-point 
1D-FFT after symmetric extension over the columns (rows). Symmetric extension 


for type II DCT is given by 
gn) = s(n) + (2N-—1—n),0Sn2S=2N-1 


where s(n) is an N-point signal; i.e., s(n) = 0, N S n = 2N — 1. Note that g(n) is 
symmetric with respect to sample N 一 A which is between the last original sam- 
ple N — 1 and the first extended sample N, and g(N) = s(N — 1) and g(2N — 1) 
= s(0). The algorithm to compute the N-point DCT using the 2N-point FFT is as 
follows: 


1. Form the 2N-point symmetrically extended signal g(n). 
2. Compute G(k), k = 0, ..., 2N — 1 the 2N-point DFT of g(n). 


3. The N-point DCT C(k)=W¥?G(k), k=0,..., N—1, where W$, = PEETI 
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‘In the case of 2D-DCT, the 2D-DCT of an N, X N, image is related to the first 
quadrant (NV, X N, points) of a 2N, X 2N, point 2D-DFT of the symmetrically 
extended signal. Symmetric extension must be performed both along the columns 


(before the column DCT) and along the rows (before the row DCT): 


g(m, n,) =s(n,, m)+s(m, 2N, =]=m) F (2M l=; n) 
HAN, =n 2N —1—4,) 


Hence, the resulting symmetrically extended 2D signal exhibits four-fold symmetry. 
Finally, the V, X N, point DCT coefficients C(k,, k,) can be expressed in terms 
of 2N, X 2N, point 2D-DFT values G(k,, k,) by 


aa 为 he 


Cth, k,) =e tay Plots, k,), for (k> k) €[0, N, —1]X[0, N, — 1] 





Note that the phase factor offsets the phase of the DFT that originates from the half- 
point symmetry about the point (N 和 y,, N= y). 


1.3 Multi-Dimensional Systems 


Classical signal-processing texts study mainly linear shift-invariant (LSI) systems (fil- 
ters), which can be classified as finite-impulse response (FIR) and infinite-impulse 
response (IIR) filters [Lim 90]. However, most image- and video-processing applica- 
tions require directional and/or adaptive filters, which are not LSI systems. Edge-adap- 
tive filters in image-processing and motion-compensated filters for video processing 
are examples of such directional filtering. Nevertheless, this section focuses on LSI fil- 
ters since LSI-FIR filters are used in some important applications, and it is important 
to gain an understanding of the limitations of LSI filters. Adaptive filters are covered 
in subsequent chapters in the context of specific applications. We do not review the 
definitions of linearity and shift-invariance here as they can be found elsewhere. 


1.3.1 Impulse Response and 2D Convolution 


Just as in 1D systems, MD-LSI filters can be completely specified by their impulse 
response. In 2D, impulse response is the response of a 2D system L to a 2D unit 
impulse input, denoted by 


blm, n,) = L[8(m,, n,)] (1.18) 
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The output of a 2D linear shift-invariant system to an arbitrary input s(7,,7,) 
can be expressed as 


g(n,,n,) = L[s(n,,n,)] (1.19) 


Expressing the input as a sum of weighted and shifted 2D unit impulses, as 
stated by Eqn. (1.5b), and substituting (1.5b) into (1.19) 


sm) = L| >>> sh BI hn — hy) 
a k 


Changing the order of linear operator L and double summation, using linearity 


gmn) =J) (kk, )L|8(n — kn, — k,)] 


kh ky 


Now using the definition of the impulse response together with shift-invariance 
of L, we obtain 


g(m,2,) = 2 Zop (hk, )A(m, — kim — k) (1.20a) 
which, by a change of variables, is equivalent to 


g(m,n,)=L, Uy, Ah k,)s(m, — km, ~k) (1.20b) 


Both (1.20a) and (1.20b) are known as 2D convolution summation. In 
theory, both s(n1,n,) and A(n,,n,) may have infinite support, which results in 
infinite summations. Then, it can be shown that the convolution summation 
converges if 


Diese a, as |My) HO (1.21) 


Stable Filters 


A 2D-LTI system is called stable if and only if its impulse response satisfies Eqn. 
(1.21). The impulse response of all FIR filters satisfies (1.21); hence, all FIR filters 
are stable. There are tests to check stability of IIR filters [Wds 06]. Only stable IIR 


filters can be implemented. 
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k, k, 


Atk, k,) 5(R, ky) 





A(n,—k,,n.—k,) 





Figure 1.6 Graphical illustration of 2D convolution. 


Computation of the Convolution Summation 


Computation of 2D convolution summation (1.20) is illustrated in Figure 1.6. We 
first flip A(k,, &,) both along the k, and k, axes to obtain h(—&,, —k,), which is then 
shifted over s(k,, k,) for specific values of n, and n,. We repeat this procedure for all 
possible values of 2, and 7. 

Numerical computation of (1.20) is possible if both the input image and the 
impulse response of the filter are finite extent. If we assume the input image is 
N, X N,, and the support of the filter b(n, 2) is M, X M, then the output will be 
(NFM -DX (N, +M,— 1) points. Then, the implementation of 2D convolution 
summation (1.20) requires M M, multiplies and M M, adds per pixel, which may 
be time consuming if M, and M, are large. 


Separable Filters 


Convolution using separable kernels results in significant computational savings. 
A filter is called separable if its impulse response is separable. 


h(n n) = h,(n,)h,(n,) 
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Convolution with a 2D separable impulse response can be written as a cascade 
of two 1D convolutions: 


gn), 25) = s(n, n,) ** h(n1, n) 
= = LD (hh ) b(n, — k,n, k) 
= SEM ADAG —k,)h,(n, — ky) 
= Thin —k, )2 pkk) h,(n, — ky) 
= Thin 一 Be 


一 h,(n,)* [A,(2,)*s(n,,7,)] 


Therefore, implementation of the separable convolution requires M multiplies and 
M adds per pixel for each 1D convolution, hence, a total of 2M multiplies and 2M 
additions. For a typical 15 X 15 filter, this means 30 multiplies and adds instead of 
225 multiplies and adds per pixel. 


1.3.2 Frequency Response 


Stable LSI systems can also be characterized by their frequency response. The fre- 
quency response of an LSI system is defined in terms of its output g(7,,7,) to a 
complex exponential input 


(172 人) (1.22) 


Substituting (1.22) into (1.20), we obtain 
g(m š m) = > ph > k, ) gorm- Yar, (m—ky)| 


kk, 


Taking the terms that do not depend on &, and &, out of the summations, 


g(n,,n,) = penn Meee eA 


kh ok 
= eilomtorm) Fr (e g ) 
where 


HM, em) = 22 Ap 所 )e Ca 有 + 名 ) (1:3) 
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is called the frequency response of the system. We observe that the output of an LSI 
system to a complex exponential input is the same input multiplied by a complex- 
valued “frequency response,” which is a function of frequencies w, and w,. Note 
that the frequency response is the MD Fourier transform of the system impulse 
response. 

The frequency response of separable filters is also separable and can be written as 


Hila T =H, (a"i JA, e] (1.24) 


Magnitude and Phase of the Frequency.Response 


Since the frequency response is a complex-valued function, it can be expressed in 
terms of its real part H, (e7 F and imaginary part H, Ci en), which are 


both real-valued functions: 
H(e” on) =H, (em en ) +58 (ar ef) (1.25a) 
or in terms of a real, positive magnitude and a real phase function 0(@,,@,) 


jo(a ,e%2) 








H(e™,e)=|H(e™,e™ le (1.25b) 
where 
and 


0( i! %)= tan” ee (1.26b) 
ei H, (e™,e/**) j 


The phase response of a filter plays a vital role in image processing. Phase distor- 
tions introduced by filtering are visible to the eye as artifacts. Therefore, it is highly 
desirable that filters used in image processing have zero or linear phase. This can be 
achieved by using FIR filters with symmetry properties as discussed in Section 1.3.3. 


Convolution in the Fourier Domain 


The output of an LSI filter to an arbitrary input can also be computed in the Fourier 
domain using the convolution property of the Fourier transform. If we take the 
Fourier transform of the convolution summation 
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G(e™ en” ) a2 
> D [Sa ae RI hee 
m=- om=-o\ 有 hy 


Rewriting the exponential function 
G(e™ se ) =. 


DpH h(k,,k,)s(m — bys, — ee 
k h 


m m 


and rearranging the terms 


Gem er” ) = 
eer an hae ree 
hy 


hy m m 


if we let mj =m —k, and n =n, —k,, we obtain 
G(e™,e) = 
(En Eu Ae kyle emt) Ts E s(n orm Je Mose) 
=H(e™,e™)S(e™,e™) (1.27) 


‘Therefore, the Fourier transform of the output at the frequency (w,, @,) depends on 
the Fourier transform of the input and the frequency response of the system only at 
the frequency (,,@,), and a stable LSI system can be completely specified by its 
impulse response or by its frequency response. We can implement Eqn. (1.27) in the 
computer using the DFT, which is only possible for FIR filters. 


1.3.3 FIR Filters and Symmetry 


Filters whose impulse response has a finite support are called finite-impulse response 
(FIR) filters. A highly desirable property of FIR filters is that they can be designed to 


have zero or linear phase. 
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Implementing (1.27) using the DFT yields circular convolution in the space 


domain, since 
H (kk, )S(k k) > h(n n) Dsn, n) 


Circular convolution is defined between two periodic signals that share a com- 
mon period. The result #(7,,7⁄,) of circular convolution of two rectangularly peri- 
odic signals is also periodic with the same period (N,, W,), given by 


N,-1N271 


Zinn) = > > bln, —i,,n, — i) i), 


i,=0 i,=0 
(n,,n,) €[0,N, —1] x [0, N, —1] 


The circular convolution produces the same result as that of linear convolution 
if we set the size of the DFTs to at least (V, + M, — 1) X (N, + M, — 1). Hence, 
both the FIR filter impulse response array and the image array must be padded by 
zeros. We can summarize the procedure to implement linear convolution in the 
DFT domain as: 


1. Pad both the FIR filter impulse response A(n,,7,) and the image s(n1, 1,) by 
zeros to obtain (N, + M, — 1) X (N, + M, — 1) arrays. 

2. Compute (NV, + M, — 1) X (N, + M, — 1) point DFTs of both h(n,n,) and 
s(n, n3). 

3. Multiply (V, + M, — 1) X (N, + M, — 1) arrays H(k,, k,) and S(k,, &,). 

4. Take (V, + M, — 1) X (N, + M, — 1) point IDFT of the product to find the 


output. 


Symmetric Filters 


Symmetric filters are desirable since they have zero-phase response or linear-phase 
response. Furthermore; the number of multiplications in the implementation of 
(1.20) can be reduced using symmetry. Practical image-processing filters are either 
two-fold, four-fold, or circularly symmetric. 

Two-fold rectangular symmetry is also called non-symmetric half-plane symme- 


try, which is defined by 
h(n n) = An =n) (1.28a) 


The distinct coefficients are depicted by dots in Figure 1.1(c). The remaining 
coefficients needed to complete the square support are determined by the symmetry. 
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A more strict form of symmetry is the four-fold rectangular symmetry given by 
AH, Bp) = ee h) = h(n,, —n) (1.28b) 


The support of the distinct coefficients in this case is a quarter-plane. 

Circular symmetry is a natural form of symmetry where coefficients are only 
a function of distance m +n, from the origin. A filter is said to have a circu- 
larly symmetric impulse response if 4(n,,7,) is a function of n; +n,. A filter is 
said to have a circularly symmetric frequency response if (e7 a) is a func- 
tion of œ? +o, for Jw; +w, <7 and is constant outside this region within 
-T =w,,@,=7. Circular symmetry of H (e7 P implies circular symmetry 
of h(n,,7,), but not vice versa. 


Example. Determine the impulse response of the ideal low-pass filter 
whose frequency response is circularly symmetric given by 


1 if Jw, to So, 
0 Jo; +w >, and 0 三 |w|, |w,|<7 


Taking the inverse 2D Fourier transform of H (e ser) yields 


H” eet = 


w 
h(n, ,n,) = ae Jı (on +7; | 
27 VM +n; 


where J (x) denotes Bessel function of first kind and first order, which can be 


expressed as a series 


3 5 7 
x x x x 
Jal) == 


= + 一 +.. 
2 2 or 3) 215r 2? 3141 





Referring to Section 1.2.2, two-fold symmetry of /(7,,7,) is sufficient for 
H Ca ane to be real; hence, zero phase. If b(n, n,) is shifted so that it is 
symmetric with respect to some point other than the origin, then it will have 
linear phase. 


1.3.4 IIR Filters and Partial Difference Equations 


The convolution summation (1.20) cannot be computationally evaluated if the 
impulse response has infinite extent. Technically, IIR filters cannot be implemented 
using the DFT method either, since the DFT cannot be defined for infinite-extent 
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sequences. However, in practice, if the filter-impulse response decays fast enough, it 
is possible to approximately implement IIR filters in the DFT domain using a suf- 
ficiently large DFT size with some tolerable spatial-domain aliasing. 

In general, in two or more dimensions, linear IIR systems are governed by partial 
difference equations given by 


&(1,7) = Ly 2 ti, g(n, —4,n, 一 万) 
+23 oi 6, s(n, 45%, 4) (1.29) 
where the output g(7,,7,) can be expressed in terms of past outputs recursively. 
Therefore, IIR filters are also called recursive filters. There are three important key 
concepts regarding recursive filters: i) recursive computability, ii) stability, and 
iii) boundary conditions. 


Recursive Computability 


In order to compute g(7,, n,) given s(z,,,) from (1.29), we need to define a scan- 
ning order to parse samples into a 1D order to label them as past, present, and 
future. Lexicographic order scans all pixels in a line from left to right and then 
advances to the leftmost sample of the next line and repeats the same process. 
Then, all samples within a non-symmetric half-plane (NSHP) support about the 
current pixel (all previous lines and samples to the left on the current line) are 
considered “past” samples and (1.29) is recursively computable if the coefficient 
array {4;;} has NSHP support meaning that the first summation contains only 
“past” terms. 


Stability 
Eqn. (1.29) will yield meaningful results if the 2D IIR filter is stable. Testing stability 
of 2D recursive filters is beyond the scope of this book, and the reader is referred to 


[Wds 06]. 


Boundary Conditions 


In order to compute the output samples g(z,,7,), we need boundary conditions 
around three borders of the output array as depicted in Figure 1.7. The width of 
the boundary is related to the order of the filter, i.e., the support of the coefficient 
array {a,, }. 
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Figure 1.7 Boundary conditions for 2D recursive filtering. 


Impulse Response 


Given a difference equation (1.29), one can compute the impulse response of the 
filter by setting the input s(7,,2,) = 6(n,n,) and all boundary values equal to zero. 


Example: 2D Auto-Regressive (AR) Model 


A 2D-AR model with NSHP support is a difference equation driven by white 
noise w(7,,7,), typically zero mean Gaussian with variance a; , where the 
coefficient array {a,, } has NSHP support and all 6,, = 0 except for by) = 1. 
The 2D AR model is given by 


5(m,,7,) = a,, s(n, —1,2, —1)+ a, s(n,,2, —1) 
tensa tLe Dt digs mhm) + (nn) 


In this case, the width of the boundary is 1 pixel; i.e., the first column, the last 
column, and the first line (row) of the output s(7,,7,) must be known in order 
to compute the rest of the output samples recursively, given the filter coeffi- 
cients and the input w(z,,7,). As a rule of thumb, the filter is generally stable if 


x Aig, < 1 


iog 
The 2D-AR model is often used to model the autocorrelation or power spec- 
trum of an image. 
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1.4 Multi-Dimensional Sampling Theory 


A 2D-video signal can be obtained by sampling a spatio-temporal intensity function 
s(x, t) of continuous spatial and temporal coordinates in at least one of the spa- 
tial variables and time. An analog-video signal is a 1D continuous function of time, 
where one of the spatial dimensions is mapped onto time by means of a scanning 
process. A digital-video representation samples s(x}, x, £) in all three dimensions, 
where (7,,7,,4) denote the discrete spatial and temporal variables. There are also 
digital, time-varying 3D-video representations. For example, in volumetric 3D-video 
representation, four dimensions (x,,x,,x,, £) (three spatial and one temporal) have 
to be sampled. This section presents a general theory of MD sampling and some 
commonly used sampling structures for analog and digital video. We introduce the 
theory of sampling MD signals on lattices including the frequency domain character- 
ization of sampled video signals. We present some examples, including sampling for 
analog video over 2D lattices and sampling for digital video over 3D and 4D lattices. 

In classical signal and image-processing texts [Cro 83, Opp 89], sampling of 
1D and 2D signals is often modeled by multiplication of the analog signal with 
an appropriate impulse train (a periodic sequence of Dirac delta functions), and 
the frequency domain analysis of sampling is introduced through convolution of 
the continuous Fourier transform of the analog signal with that of an appropri- 
ate impulse train (using the modulation property of the Fourier transform). While 
this framework can be easily extended to the case of MD sampling on rectangular 
grids (or other structures where the resulting impulse train is separable), it is not 
straightforward to study MD sampling on arbitrary periodic structures (e.g., verti- 
cally aligned 2:1 interlaced sampling) with this approach. Thus, we adopt the more 
general lattice framework [Dud 84, Dub 85] to study MD sampling, and present 
special cases to clarify the concept and notation. 


1.4.1 Sampling on a Lattice 


We begin with the definition of an MD lattice. Let v,,v,,...,V,, be linearly inde- 
pendent vectors in the MD Euclidean space R“. A lattice AM € R™ is the set of all 
linear combinations of v,,v,,...,V,, with integer coefficients given by 


AY = {nv +n, + .+n vy = Vno | m,m,- ny € Z} (1.30) 
The set of vectors v}, Vz, --- VM is called a basis for the lattice AM, which defines 


an MD arbitrary periodic sampling structure. An example of a 2D sampling lattice 
is shown in Figure 1.8. 
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Figure 1.8 2D sampling lattice with M = 2. 


In vector-matrix notation, a lattice AM is the set of points defined by 
A™ = {Vn|n €Z"} 


where V is called an M X M sampling matrix given by 
V= [| |... [¥% 4] (1.31) 


andn = (hais A The basis, and thus the sampling matrix, for a given lattice 
is not unique. In particular, for every sampling matrix V, EV, where E is an integer 
matrix with det{E} = +1, forms another sampling matrix for AM. However, the 
quantity d(A™) = |det{V}| is unique and denotes the reciprocal of the sampling 
density. 

Then, the sampled MD signal can be expressed as 


s(n) = s(Vn),n € ZM (1.32) 
= s(x), x E AM 
The most suitable sampling structure for a time-varying image depends on its 


spatio-temporal frequency content. We present some examples in order to clarify the 
concept and the notation: 


1. 2D rectangular sampling: The 2 X 2 matrix V is diagonal and applies to both 
rectangular sampling of still images in the horizontal x, and vertical x, direc- 
tions and progressive (non-interlaced) analog video that has been sampled in 
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the vertical x, and temporal ż directions. In the former case, the diagonal ele- 
ments of V are the horizontal and vertical sampling distances, Ax, and Ax,, 
respectively, and the sample locations are 


wi = n Ax, 


ie n,Ax, 


x. 


The sampled signal can be expressed in the unitless coordinates (”,,7,) as 
S(t) = fn, Ax,; n Ax), (2,25) E Z?. In the latter case, an analog-video 
signal is obtained by sampling the time-varying image intensity distribution in 
the vertical x, and temporal t directions by a process known as scanning, and 
the sample locations are 


X= n Ax, 


t= kAt 


Continuous intensity information along each horizontal line is concatenated to 
form a 1D analog video signal as a function of time. The 2D rectangular sam- 
pling grid, which yields progressive analog video, and the associated sampling 
matrix V are shown in Figure 1.9, where each dot indicates a continuous line of 
video perpendicular to the plane of the page. 


2D sampling on arbitrary lattices: It applies to non-rectangular periodic sampling 
of still images in the x, and x, directions, or 2:1 interlaced sampling of analog 
video in the vertical x, and temporal ¢ directions. The sampling geometry can 
be specified by two basis vectors v, = [v,, wal and v, = [v,, v1" as: 


xi = v% + vm 


X, = VR + Vn, 





Figure 1.9 Orthogonal sampling structure for progressive analog video. 
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In vector-matrix form, we have x = Vn, where x = [x, Eales a= [oj ml” 
and V = [v; | v,] is the sampling matrix. Then, the sampled signal can be 


expressed as 
s(n) = 5 (Vn),n € Z? 


The hexagonal sampling structure, which yields 2:1 interlaced analog video and 
the associated sampling matrix V, is depicted in Figure 1.10. 


3. 3D sampling on lattices: This case applies to sampling of a 3D volume in the x,, 
x, and x, directions, or sampling video signals in the x,, x,, and ¢ directions. 

Digital video can be captured using a digital camera that records samples on 
an explicit 3D structure or by sampling the analog-video signal in the horizon- 
tal direction (along the scan lines), which results in an array of color/intensity 
samples on an implicit 3D structure. The latter process is known as analog-to- 
digital conversion. 

Examples of 3D sampling lattices and their corresponding V matrices are 
depicted in Figure 1.11 and Figure 1.12, where each circle indicates a pixel 
location. The letter “p” inside the pixels indicates “progressive” sampling, where 
all pixels are sampled at the same time, and letters “o” and “e” indicate odd and 
even field pixels, respectively, which are sampled with a Az/2 sec. time differ- 
ence. The reader is refered to [Dub 85 ] for further details on MD sampling and 
sampling structures. 


The sampling structures shown here are field or frame instantaneous; i.e., a 
complete field or frame is acquired at one instant. An alternative strategy is time- 
sequential sampling, where individual samples are taken one at a time according to 


2A% Ax, 
0 A2 





Ad/2 


Figure 1.10 Hexagonal sampling lattice for 2:1 interlaced analog video. 
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Figure 1.11 Orthogonal sampling lattice for progressive digital video. 
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Figure 1.12 Vertically aligned 2:1 interlace lattice for interlaced digital video. 


a prescribed ordering that is repeated after one complete frame. A complete analysis 
of time-sequential sampling can be found in [Rah 92]. 


1.4.2 Spectrum of Signals Sampled on a Lattice 


Let's first recall that the MD continuous Fourier transform (FT) S (F, F>... F pot 
an analog signal s (x,,x,,...,X,) is given by 


EFF) = 


Do oo 
—j27(Rx +Bxt+..+Fyxm ) 
人 init dx, dec, ... dey 


where (FF ,ash E R™ and Sere eg = RM. The inverse 2D Fourier trans- 
form is given by 


AEE ten Xa = 


f ai S, (E, Pyy... Fy) e70 het tp 妈妈， 
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Here, the spatial-frequency variables (F,, F,» ---, Fy) have the units cycles/mm 
and are related to the radian frequencies by a factor of 27r, i.e., u, = 27F, i= 1,...,M. 
We can restate the MD Fourier transform relations in compact vector notation as 


oo 


二 人 (1.33) 


5.(x) = (i i S(F)e??"* *dF (1.34) 


where x! = [ad F= LF sks cos E: 
The Fourier transform of a discrete signal s(n), sampled on a lattice A”, is 
defined, in vector notation, by 


sh) ale (1.35) 


in terms of the unitless (normalized) frequency variables f,,f,,....fyp where £7 = 
[bs bland al = 区 #>,...,M,]. Note that w,= 27 fo i= 1,..., M. The inverse 
Fourier transform can be expressed as 


s(n) = f a (1.36) 


Recall that the Fourier transform S(f) of a discrete signal is periodic with the 
fundamental period f< |1/2|,i = 1,..., M. 

In order to quantify the relationship between the Fourier transform S(f) of a 
signal sampled on a lattice and that of S (F) the underlying analog signal, we next 
define the reciprocal lattice and the unit cell of a lattice. Given a lattice A”, the set 
of all vectors r such that r7x is an integer for all x € A” is called the reciprocal lattice 
A% of A™. A basis for AM” is the set of vectors u,,u,,...,u,, determined by 


A R A J = 1,2... MM (1.37) 
or, equivalently, by 
U'V=lL, 
where I,, is an M X M identity matrix. We will see that the Fourier transform of a 


signal sampled on a lattice A” consists of sum of periodic replications of the spec- 
trum of the analog signal on the reciprocal lattice 人 4 
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Figure 1.13 Voronoi cell of a 2D lattice. 


The definition of the unit cell of a lattice is not unique. Here, we define the Vor- 
onoi cell of a lattice as a unit cell. The Voronoi cell, depicted in Figure 1.13, is the set 
of all points that is closer to the origin than to any other sample point. It corresponds 
to the fundamental period of the lattice. 


1.4.3 Nyquist Criterion for Sampling on a Lattice 


In this section, we study frequency domain analysis of sampling on MD lattices. We 
start by substituting Eqn. (1.34) into (1.32) to obtain 


s(n) =5,(Vn) = [ S.(F)e 7" dF 


After the change of variables f = VT F, we have 


fsS.(Uf)e "dF 


一 09 


in) = 


SER 
|det(V)| 


where U = (V')~! is the sampling matrix of the reciprocal lattice A™* and df = 
| det (V)| dF. 

Expressing the integration over the entire f plane as a sum of integrations over 
the unit-cell squares (— 1/2, 1/2) X (— 1/2, 1/2), we obtain 
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1/2 


1 j27 f'n 一 /2m kIn 
三 -一 一 一 S (U(f —k))e’ f df 
stn) e EAU di 


k _1/2 


where eri2rk "一 1 for k, an integer-valued vector (by definition of the reciprocal 


lattice). Finally, comparing this expression with (1.36), we conclude that 


S(f) = ——_ ©, S. (U(f — k)) (1.38) 


[de wv )| 
where U is the periodicity matrix in the frequency domain that satisfies UTV = I,, 
and Ti is the M X M identity matrix. The periodicity matrix can be expressed as 
U =[u, |u, |...| uy], where u, ...,u,, are the basis vectors of the reciprocal lattice. 

The MD signal sampled on a lattice can alternatively be expressed in terms of 
continuous variables as 


5, (x) = 5,(x) 20 em 6(x — Vn) = Cocco s(n)d(x — Vn) (1.39) 
The Fourier transform 9 (F) of the sampled signal s (x) in terms of that of the 
continuous signal S (F) can be obtained from (1.38) by a change of variables as 


S,(F)= ee (FU) (1.40) 


fev )| 

As expected, the Fourier transform of the sampled signal is the sum ofan infinite 
number of replications of the Fourier transform of the continuous signal, shifted 
according to the reciprocal lattice A”. This is illustrated in Figure 1.14 for the case 
of a circularly bandlimited signal. 


Special Case: Two-Dimensional Rectangular Sampling 


Rectangular sampling is a special case of lattice sampling where the matrices 
V and U are diagonal. The classical approach of multiplication by an impulse 
train and use of the modulation property of the Fourier transform would 
actually suffice in this case (see Exercise 1.7). However, we take this special 
case to demonstrate the lattice analysis in a step-by-step fashion. We first sub- 
stitute (1.34) into (1.32) with M = 2, and evaluate x, and x, at x, = n,Ax, 
and x, = 2,Ax, to obtain 


s(1,7,) = ffs. (FF eh atest) dF, dF, 
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(c) 


Figure 1.14 Sampling on an arbitrary 2D periodic grid: (a) spectral support of the continuous 
image; (b) the sampling grid; (c) spectral support of the sampled image. 


After a change of variables f = F,Ax, and f, = F,Ax,, we have 


sé. = | erfat hm) ae df 








1 
s(n) = oR 
1 2 


Next, we break the integration over the entire (fi,f) plane into a sum of 
integrals over unit cells denoted by SQ(A,, &,): 


ffs E ro -| janl fnt fm) ge ge 


kh ke $Q(k, sky) 





s(m,m) = 


where SQ(&,, k,) is defined as 


1 1 1 1 
ls ae ta and Eika Tote 
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Another change of variables, f = f — k, and f, = »» shifts all unit cells 
SQ(k,, k,) down to the fundamental mad (= y,, as l- yY) 
s(n n) = 
j = k; ah Pia fin + fan 
Pel T. 1.41 
aes ee es s[i Ax, = tf dfa (1.41) 


since e P7tnthm)—1 for k,» k,» n,» n, integers. Now, rewriting (1.36) for 
M = 2, we have 


r j27( fim + fim 
stno) = {ys fe df df (1.42) 
and comparing the expressions (1.41) and (1.42), we conclude that 
fi Zi k, fh = k, 
Si a (1.43) 
Ce 后 F per T > [44 , 3 


Note that (1.43) is a special case of (1.38) where M = 2 and the 2 X 2 matrix 
V is diagonal. We see that, as a result of sampling, the spectrum of the con- 
tinuous signal replicates in the 2D frequency plane according to (1.43). The 
case when the continuous signal is bandlimited with a circular spectral sup- 
port of radius B < max{1 /(2Ax,),1/ (2Ax,)} is depicted in Figure 1.15. 


If the image is not bandlimited or the sampling intervals Ax, and Ax, do not 
satisfy the conditions of the Nyquist sampling theorem, then the replications 
overlap with each other, which results in aliasing. This latter case is illustrated 
in Figure 1.16. 


In all sampling problems, it is also possible to define the sampled signal in 
terms of the continuous coordinate variables by using the 2D Dirac delta 
signal as 
5 pok) = s tis) Dy Sa E =m Ax —n,Ax,) 
= > die, s,(m,Ax,,n,Ax,)6(x, —n,Ax,,x, —n,Ax,) 


= Ln 之 5(2722)0(Xi n, Ax, , x, = n,Ax,) (1.44) 
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Figure 1.15 Sampling on a 2D rectangular grid: (a) spectral support of the continuous image; 
(b) the sampling grid; (c) spectral support of the sampled image. 
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Figure 1.16 Illustration of aliasing, when Nyquist sampling rate is violated. 
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Observe that s,(x,,x,) is indeed a sampled signal because of the presence of 
the 2D Dirac delta function in its definition. Then, the relationship between 
the Fourier transform S (Fp F) of s, 2) and that of the analog signal 
can be obtained from Eqn. (1.43), by a change of variables f = F| Ax and 
= FAx as 











1 k k 
alee = S$(FAx,, 7 Ax,)= inde, ay s {A Foi = fal (1.45) 
Note that S (Fp Fy) is periodic with the fundamental period F < = : 
i= 1,2. i 


1.4.4 Reconstruction from Samples on a Lattice 


Various digital-video systems have different spatio-temporal resolution requirements, 
which necessitate sampling structure conversion. The sampling structure conversion 
problem, which is treated in Section 1.5, can alternatively be posed as reconstruc- 
tion of the underlying continuous spatio-temporal video, followed by its resampling 
on the desired spatio-temporal lattice. Thus, we briefly discuss reconstruction of a 
continuous video signal from its samples. 

The reconstructed time-varying image s(x, t) can be obtained through the ideal 
low-pass filtering operation 


sa for FE P aa 


0 otherwise 
Here, the passband of the ideal low-pass filter is determined by the unit cell P of 


the reciprocal sampling lattice, which is depicted by the dotted lines in Figure 1.17. 
Taking the inverse Fourier transform, we have the reconstructed time-varying 














image: 
x n 
s,(x,t) = Datos Anh : - yv 8 | 
et sor] A -| "| (1.47) 
where 


b(x,t)=|der(V)| f e ar gp (1.48) 
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Figure 1.17 Passband of the ideal reconstruction filter. 


is the impulse response of the ideal bandlimited spatio-temporal interpolation fil- 
ter for the sampling structure used. Unlike the case of rectangular sampling, this 
integral, in general, cannot be reduced to a simple closed-form expression. As 
expected, exact reconstruction of a continuous signal from its samples on a lattice 
A? is possible if the signal spectrum is confined to a unit cell P of the reciprocal 
lattice. 


1.5 Sampling Structure Conversion 


Various digital-video systems, ranging from ultra high-definition TV to mobile video, 
have different spatio-temporal resolution requirements leading to the emergence of 
different format standards. The task of converting digital video from one format to 
another is referred to as standards conversion. Effective standards conversion enables 
exchange of information among various digital-video systems, employing different 
format standards, to ensure their interoperability. 

Standards conversion is a sampling structure conversion problem, i.e., a spatio- 
temporal interpolation/decimation problem. In theory, sampling structure con- 
version can be treated in two steps: reconstruction of the underlying continuous 
spatio-temporal signal, followed by its resampling on the desired MD sampling 
structure. Here, we introduce an all-digital formulation, where general sampling 
structure conversion from an MD input lattice to an MD output lattice is posed 
as an MD digital signal-processing problem with or without taking advantage of 
the temporal redundancy in the video. This section only introduces the general 
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Figure 1.18 Decomposition of the system for sampling structure conversion. 
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problem formulation without going into the specifics of the filters involved. Some 
practical image decimation and interpolation filters are discussed in Chapter 3. 
Video-filtering methods, including intra-frame/field and inter-frame/field meth- 
ods, which implicitly or explicitly use interframe motion information, are pre- 
sented in Chapter 6. 

In order to define the general sampling structure conversion problem, we need 
to define the sum of two lattices, shown in Figure 1.18, as well as the intersection of 
two lattices. We define the sum of two lattices as 


Av + Ay = {x, +x, |x, € AY andx, € AY} (1.49) 


Thus, the sum of two lattices can be found by adding each point in one lattice to 
every point of the other lattice. We can also define the intersection of two lattices as 


AY NAY = {x|x€ AY andx€ AY} (1.50) 
The intersection AY A‘ is the largest lattice that is a sublattice of both A” and 


Ay’, and the sum AM” + AY is the smallest lattice that contains both A” and AY 
as sublattices. 


The up-conversion from A” to A” + A’ is defined as follows: 


5, (x) xe A” 
u,(x)=Us, (x)= (1.51) 
0 xg AM andx € A” + AY 


and down-conversion from A” + AY’ to AY as 


y,(x) =Dw, (x) =u, (x) xe AY (1.52) 
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The low-pass filter is applied on the lattice A’ + AY, which has a higher sam- 
pling density than both AY and A‘ . By definition, this filter will be shift-invariant 
if the output gets shifted by a vector p when the input is shifted by p. Thus, we need 
p € A” nA”. This condition is satisfied if Ay N A” is a lattice; i.e., Vi V, is a 
matrix of rational numbers. This requirement is the counterpart of L/M needing to 
be a rational number in the case of 1D sampling rate change problems. 

The linear shift-invariant filtering operation on A” + A‘! can be expressed as 


PC) ta Dae u,(z)h,(x—z) x€ AY +e (1.53) 


However, by the definition of the up-sampling operator, u,(x)=s,(x) for 
x € A” and zero otherwise; thus, 


w,(x)= > s,(z) h,(x—z) xe AY + AY 


zEAi 
After the down-sampling operation， 
bat Seer s,(z)h, (x —z) xe AY (1.54) 


The frequency response of the filter is periodic with the main period determined 
by the unit cell of (ar FA" . In order to avoid aliasing, the passband of the 
interpolation/anti-alias (low-pass) filter is restricted to the smaller of the Voronoi 
cells of (A) * and ( x ) “. Sampling lattice conversion is illustrated by the follow- 
ing examples. 


Example: Conversion from /A? to AZ [Dub 85] 


We consider a 2D sampling lattice conversion from the input lattice Aj 
to the output lattice A’, which is shown in Figure 1.19 along with the 
sampling matrices, V, and V,. The sampling densities of Ai and A‘, are 
inversely proportional to the determinants of V, and V,, given by 


det(V,) = 2Ax, Ax, and det(V,) = 4Ax, Ax, 


Hence, the sampling density of the output lattice A’, is one half of that of 
the input lattice A‘. Also shown in Figure 1.19 are the lattices A? + A’, and 
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T Ax, 0 
2 12Ax, 4Ax, 





Ax, 10 TE 2Ax,, 0 
eho. we. 4] 0 4Ax, 
Figure 1.19 Lattices A’, A2, A4 + A3, and AmA2 [Dub 85]. (© 1985 IEEE) 


N N N, , along with their sampling matrices, V and V,, which can be used 
to define the sampling conversion factor 


Q=(A,+ Ay: A )= (4: AN AY) =2 


Note that A’ + A’, is obtained by adding all x, € Aj to x, € A%, and 
A NAF contains all points that are elements of both lattices. 


Because we have a down-conversion problem, by a factor of 2, anti-alias fil- 
tering must be performed on the lattice A} + A’,. The fundamental period 
of the filter-frequency response is given by the unit cell of (A; A be i 
which is indicated by the dotted lines in Figure 1.20. Note that the unit cell 
of (A; F X) * is determined by the matrix U, = (vy ite . In order to avoid 
aliasing, the passband of the low-pass filter performed on the lattice A4 + A’, 
must be restricted to the Voronoi cell of CAR which is determined by the 
matrix U, = (vy T and has a hexagonal shape. 
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Figure 1.20 Spectral support of s(x) with the periodicity matrix U,, and the support of the 
frequency response of the filter [Dub 85]. (© 1985 IEEE) 


Example: De-interlacing 


De-interlacing refers to conversion from an interlaced sampling grid (input) 
to a progressive grid (output), as shown in Figure 1.21(a) and (b), respec- 
tively. The sampling matrices for the input and output grids are 





Figure 1.21 Interlaced to progressive conversion: (a) input; (b) output lattices. 
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Ae o 0 
Vn = 0 2Ax, Ax, 
0 0 At 
and 
Ay, 0 0 
Va =| 0 Ax, 0 
0 0 At 
respectively. Note that |det Vi | = 2|det V „l; hence, we have a spatio- 


temporal interpolation problem by a factor of 2. The interpolation can be 
achieved by zero filling followed by ideal low-pass filtering. The passband of 
the ideal low-pass filter should be restricted to the unit cell of the reciprocal 
lattice of the output sampling lattice, which is given by 














SEE 1 N 1 x|- 1 1 
2Ax “2Ax， 2Ax, 2Ax， 2At ` 2At 


In spatio-temporal sampling structure down-conversion without motion compen- 
sation, there is a tradeoff between allowed aliasing errors and loss of resolution (blur- 
ring) due to anti-alias (low-pass) filtering prior to down-conversion. When anti-alias 
filtering has been used prior to down-conversion, the resolution that is lost cannot 
be recovered by subsequent interpolation. We present motion-compensated filtering 
methods to incorporate interframe motion information to sampling structure conver- 
sion in Chapter 6. Motion-compensated interpolation makes it possible to recover 
full-resolution frames by up-conversion of previously down-converted frames if no 
anti-alias filtering has been applied in the down-conversion process. 
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Exercises 


Problem Set 1 
1.1 The signal 


1 ifn=n 
> a 6 = : 
s(n, n) (2, — 7) |. otherwise 


where 6, (7) is a 1D impulse, called a line impulse. Draw this signal and find 
the angle it makes with the horizontal axis. Can you write other line impulse 
signals with other angular orientations without a gap in the line? 


1.2 Draw the signal 
5(111, 7) = typln 7) u(n 一 7) 
where ui(n) is a 1D unit step. Is this signal separable? Why or why not? 
1.3 Let a periodic signal satisfy 
snan) = s(n, + 3,n, +5) = Kf, — 2,75 —3) 


Find a periodicity matrix N. Is it rectangularly periodic? Why or why not? 
1.4 Find a periodicity matrix for 


2am, | 27n, | 


5 ,1,) = sin 
(m,n, | 16 
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L5 


1.6 


L7 


1.8 


La 


and 





2 
s (nmm) = sn| a 


Is si (11, 723) = 5,(m,, n)? Explain. 
Convolve 


1 Veg SN mL oS EN 
0 otherwise 


san) =| 


with 


YY. -1<=n <1, -1=n,= 
Henn) =| 74 ii ae ian 
0 


otherwise 


Find the frequency response of this filter. 


Suppose we wish to convolve an N, X N, image with a K, X K, impulse 
response. We have the option of implementing this as a spatial-domain con- 
volution summation or by multiplication in the Fourier domain. For what 
values of N = N, + N, and K = K, + K, will the spatial-domain convolu- 
tion be faster than going through the Fourier domain? Explain. 


In the case of rectangular sampling, the classical approach of multiplication 
of the analog signal s (x,,x,) by a 2D rectangular impulse train, 


s (Xi,%2) = 5,(%,,%2)° P O(x, — 2 Ax, 5%, — mA) 


a a a 


and then use of the modulation property of the 2D continuous Fourier 
transform would suffice to derive (1.43). Derive (1.43) using the approach 
outlined above. 


The expression (1.43) assumes impulse sampling; i.e., the sampling aperture 
has no physical size. A practical camera has a finite pixel aperture modeled by 
the impulse response / (x,,x,). How would you incorporate the effect of the 
finite aperture size into (1.43)? 


Suppose a camera samples with 20-micron intervals in both the horizontal 
and vertical directions. What is the highest spatial frequency in the sampled 
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image that can be represented with less than 3 dB attenuation if a) 4,(x,,x,) is 
a 2D Dirac delta function and b) 4,(x,,x,) is a uniform circle with diameter 
20 microns? 


Strictly speaking, images with sharp spatial edges are not bandlimited. Discuss 
how you would digitize an image that is not bandlimited. 


Evaluate the impulse response (1.48) if P is the unit sphere. 


Find the locations of the spectral replications for the 3D sampling lattices 
depicted in Figure 1.11 and Figure 1.12. 


Suppose an image is sampled on a sampling lattice defined by 
Ax. 
v= S |, 
O Ae 


a. Compute and show where the replications occur in the spatial-frequency 
domain. 

b. Show the support of the frequency response of the ideal conversion filter 
if we wish to convert this sampling structure to a lattice defined by 


v,= “ho | 


0 2Ax, 


MATLAB Exercises 


下 


Fourier Magnitude and Phase Relations: Take two gray-level images, x[7,,7,] 
and y[”,,2,]. Compute the magnitude and phase functions |X[k,,&,]|; 
Polko k,l, |Y[k, k,ll, and P[k, k] of their 2D-DFT, respectively, where 
X[k, ky] =|X[k k, le”! and Y[k,k,]=|Y [k k, ]| e7. 


a. Define two new Fourier transforms W[k,,k,]=1e”*"! and Z[k,k,]= 
|X[k,,4, ]|e’°. Compute and display the images w(n,,n,] and 2[n,,7,]. 
Print both images. Which one looks more similar to x[n n]? 


b. Now define A[k,,k,] =|Y [k ,k,]|e™***! and BIk ,hk,]= |X Tk k, 1| ee), 


Compute and display the images a[n], n,] and b[n', n,]. Print both images. 
What do they look like? Explain what you see. 


Exercises 51 


1.2 Frequency Response of Filters, DFT: Given a 1D filter box filter with length nine 
samples, specified by h = (1/9) ones (9,1) 
a. Generate a separable 2D filter impulse response from the above 1D filter. 
Plot the frequency response of the 2D filter. What kind of filter is it? 
b. Generate a circularly symmetric 2D filter impulse response. Plot the 
frequency response of the 2D filter. 


1.3 Spatial-Frequency Patterns: Generate the horizontal spatial-frequency pattern 
s(n,,2,) = 127cos(27k,n,/512) + 128, 0 = = 511, m = Sil 


as an image. Display this image for different values of k,. 


CHAPTER 2 


Digital Images and Video 





Advances in ultra-high-definition and 3D-video technologies as well as high-speed 
Internet and mobile computing have led to the introduction of new video services. 


Digital images and video refer to 2D or 3D still and moving (time-varying) visual 
information, respectively. A still image is a 2D/3D spatial distribution of intensity 
that is constant with respect to time. A video is a 3D/4D spatio-temporal inten- 
sity pattern, i.e., a spatial-intensity pattern that varies with time. Another term 
commonly used for video is image sequence, since a video is represented by a time 
sequence of still images (pictures). The spatio-temporal intensity pattern of this time 
sequence of images is ordered into a 1D analog or digital video signal as a function 
of time only according to a progressive or interlaced scanning convention. 

We begin with a short introduction to human visual perception and color models 
in Section 2.1. Next, we present 2D digital video repre-sentations and a brief sum- 
mary of current standards in Section 2.2. We introduce 3D digital video display, 
representations, and standards in Section 2.3. Section 2.4 provides an overview of 
popular digital video applications, including digital TV, digital cinema, and video 
streaming. Finally, Section 2.5 discusses factors afecting video quality and quanti- 
tative and subjective video-quality assessment. 
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2.1 Human Visual System and Color 


Video is mainly consumed by the human eye. Hence, many imaging system design 
choices and parameters, including spatial and temporal resolution as well as color 
representation, have been inspired by or selected to imitate the properties of human 
vision. Furthermore, digital image/video-processing operations, including filtering 
and compression, are generally designed and optimized according to the specifica- 
tions of the human eye. In most cases, details that cannot be perceived by the human 
eye are regarded as irrelevant and referred to as perceptual redundancy. 


2.1.1 Color Vision and Models 


The human eye is sensitive to the range of wavelengths between 380 nm (blue end 
of the visible spectrum) and 780 nm (red end of the visible spectrum). The cornea, 
iris, and lens comprise an optical system that forms images on the retinal surface. 
There are about 100-120 million rods and 7-8 million cones in the retina [Wan 
95, Fer 01]. They are receptor nerve cells that emit electrical signals when light hits 
them. The region of the retina with the highest density of photoreceptors is called 
the fovea. Rods are sensitive to low-light (scotopic) levels but only sense the intensity 
of the light; they enable night vision. Cones enable color perception and are best in 
bright (photopic) light. They have bandpass spectral response. There are three types 
of cones that are more sensitive to short (S), medium (M), and long (L) wavelengths, 
respectively. The spectral response of S-cones peak at 420 nm, M-cones at 534 nm, 
and L-cones at 564 nm, with significant overlap in their spectral response ranges and 
varying degrees of sensitivity at these range of wavelengths specified by the function 
m,(A), k = r, g, b, as depicted in Figure 2.1 (a). 

The perceived color of light f(x,,x,,A) at spatial location (x,,x,) depends on the 
distribution of energy in the wavelength A dimension. Hence, color sensation can 
be achieved by sampling A into three levels to emulate color sensation of each type 
of cones as: 


falais) = f f(A), (AAA k=r,g,b (2.1) 


where m,(A) is the wavelength sensitivity function (also known as the color- 
matching function) of the kth cone type or color sensor. This implies that perceived 
color at any location (x,,x,) depends only on three values f, fo and f, which are 
called the tristimulus values. 

It is also known that the human eye has a secondary processing stage whereby 
the R, G, and B values sensed by the cones are converted into a luminance and two 


2.1 Human Visual System and Color 55 


color-difference (chrominance) values [Fer 01]. The luminance Y is related to the 
perceived brightness of the light and is given by 


¥ (xx) = f flisar A) LA) AA (2.2) 


where KA) is the International Commission on Illumination (CIE) luminous effi- 
ciency function, depicted in Figure 2.1(b), which shows the contribution of energy 
at each wavelength to a standard human observer's perception of brightness. Two 
chrominance values describe the perceived color of the light. Color representations 
for color image processing are further discussed in Section 2.2.3. 
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Figure 2.1 Spectral sensitivity: (a) CIE 1931 color-matching functions for a standard observer with 


a 2-degree field of view, where the curves x, y, and Z may represent m, (A), mM, (A), and m, (A), 
respectively, and (b) the CIE luminous efficiency function A) as a function of wavelength A. 
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Now that we have established that the human eye perceives color in terms of 
three component values, the next question is whether all colors can be reproduced 
by mixing three primary colors. The answer to this question is yes in the sense that 
most colors can be realized by mixing three properly chosen primary colors. Hence, 
inspired by human color perception, digital representation of color is based on the 
tri-stimulus theory, which states that all colors can be approximated by mixing 
three additive primaries, which are described by their color-matching functions. As 
a result, colors are represented by triplets of numbers, which describe the weights 
used in mixing the three primaries. All colors that can be reproduced by a com- 
bination of three primary colors define the color gamut of a specific device. There 
are different choices for selecting primaries based on additive and subtractive color 
models. We discuss the additive RGB and subtractive CMYK color spaces and color 
management in the following. However, an in-depth discussion of color science is 
beyond the scope of this book, and interested readers are referred to [Tru 93, Sha 98, 
Dub 10]. 


RGB and CMYK Color Spaces 


The RGB model, inspired by human vision, is an additive color model in which red, 
green, and blue light are added together to reproduce a variety of colors. The RGB 
model applies to devices that capture and emit color light such as digital cameras, 
video projectors, LCD/LED TV and computer monitors, and mobile phone dis- 
plays. Alternatively, devices that produce materials that reflect light, such as color 
printers, are governed by the subtractive CMYK (Cyan, Magenta, Yellow, Black) 
color model. Additive and subtractive color spaces are depicted in Figure 2.2. RGB 
and CMYK are device-dependent color models: i.e., different devices detect or repro- 
duce a given RGB value differently, since the response of color elements (such as 
filters or dyes) to individual R, G, and B levels may vary among different manufac- 
turers. Therefore, the RGB color model itself does not define absolute red, green, and 
blue (hence, the result of mixing them) colorimetrically. 

When the exact chromaticities of red, green, and blue primaries are defined, 
we have a color space. There are several color spaces, such as CIERGB, CIEXYZ, or 
sRGB. CIERGB and CIEXYZ are the first formal color spaces defined by the CIE 
in 1931. Since display devices can only generate non-negative primaries, and an 
adequate amount of luminance is required, there is, in practice, a limitiation on the 
gamut of colors that can be reproduced on a given device. Color characteristics of a 
device can be specified by its International Color Consortium (ICC) profile. 
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Figure 2.2 Color spaces: (a) additive color space and (b) subtractive color space. 


Color Management 


Color management must be employed to generate the exact same color on different 
devices, where the device-dependent color values of the input device, given its ICC pro- 
file, is first mapped to a standard device-independent color space, sometimes called the 
Profile Connection Space (PCS), such as CIEXYZ. They are then mapped to the device- 
dependent color values of the output device given the ICC profile of the output device. 
Hence, an ICC profile is essentially a mapping from a device color space to the PCS 
and from the PCS to a device color space. Suppose we have particular RGB and CMYK 
devices and want to convert the RGB values to CMYK. The first step is to obtain the 
ICC profiles of concerned devices. To perform the conversion, each (R, G, B) triplet is 
first converted to the PCS using the ICC profile of the RGB device. Then, the PCS is 
converted to the C, M, Y, and K values using the profile of the second device. 

Color management may be side-stepped by calibrating all devices to a common 
standard color space, such as sRGB, which was developed by HP and Microsoft 
in 1996. sRGB uses the color primaries defined by the ITU-R recommendation 
BT.709, which standardizes the format of high-definition television. When such a 
calibration is done well, no color translations are needed to get all devices to handle 
colors consistently. Avoiding the complexity of color management was one of the 
goals in developing sRGB [IEC 00]. 


2.1.2 Contrast Sensitivity 


Contrast can be defined as the difference between the luminance of a region and its 
background. The human visual system is more sensitive to contrast than absolute 
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luminance; hence, we can perceive the world around us similarly regardless of changes 
in illumination. Since most images are viewed by humans, it is important to under- 
stand how the human visual system senses contrast so that algorithms can be designed 
to preserve the more visible information and discard the less visible ones. Contrast- 
sensitivity mechanisms of human vision also determine which compression or pro- 
cessing artifacts we see and which we don't. The ability of the eye to discriminate 
between changes in intensity at a given intensity level is quantified by Weber's law. 


Weber’s Law 


Weber’s law states that smaller intensity differences are more visible on a darker back- 
ground and can be quantified as 


= =c (constant), for / >0 (2.3) 


where A7 is the just noticeable diference (JND) [Gon 07]. Eqn. (2.3) states that the 
JND grows proportional to the intensity level Z. Note that Z= 0 denotes the dark- 
est intensity, while Z= 255 is the brightest. The value of c is empirically found to be 
around 0.02. The experimental set-up to measure the JND is shown in Figure 2.3(a). 
The rods and cones comply with Weber’s law above -2.6 log candelas (cd)/m2 (moon- 
light) and 2 log cd/m2 (indoor) luminance levels, respectively [Fer 01]. 


Brightness Adaptation 


The human eye can adapt to different illumination/intensity levels [Fer 01]. It has 
been observed that when the background-intensity level the observer has adapted to 
is different from J, the observer's intensity resolution ability decreases. That is, when 
J, is different from 7 as shown in Figure 2.3(b), the JND A7 increases relative to 
the case /, = Z. Furthermore, the simultanenous contrast effect illustrates that humans 
perceive the brightness of a square with constant intensity differently as the intensity 
of the background varies from light to dark [Gon 07]. 

It is also well-known that the human visual system undershoots and overshoots 
around the boundary of step transitions in intensity as demonstrated by the Mach 
band effect [Gon 07]. 


Visual Masking 


Visual masking refers to a nonlinear phenomenon experimentally observed in the 
human visual system when two or more visual stimuli that are closely coupled in 
space or time are presented to a viewer. The action of one visual stimulus on the 
visibility of another is called masking. The effect of masking may be a decrease in 
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Figure 2.3 Illustration of (a) the just noticeable difference and (b) brightness adaptation. 


brightness or failure to detect the target or some details, e.g., texture. Visual masking 
can be studied under two cases: spatial masking and temporal masking. 


Spatial Masking 


Spatial masking is observed when a viewer is presented with a superposition of a 
target pattern and mask (background) image [Fer 01]. The effect states that the vis- 
ibility of the target pattern is lower when the background is spatially busy. Spa- 
tial busyness measures include local image variance or textureness. Spatial masking 
implies that visibility of noise or artifact patterns is lower in spatially busy areas of an 
image as compared to spatially uniform image areas. 


Temporal Masking 


Temporal masking is observed when two stimuli are presented sequentially [Bre 07]. 
Salient local changes in luminance, hue, shape, or size may become undetectable in the 
presence of large coherent object motion [Suc 11]. Considering video frames as a sequence 
of stimuli, fast-moving objects and scene cuts can trigger a temporal-masking effect. 


2.1.3 Spatio-Temporal Frequency Response 


An understanding of the response of the human visual system to spatial and tempo- 
ral frequencies is important to determine video-system design parameters and video- 
compression parameters, since frequencies that are invisible to the human eye are 
irrelevant. 
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Spatial-Frequency Response 

Spatial frequencies are related to how still (static) image patterns vary in the horizontal 
and vertical directions in the spatial plane. The spatial-frequency response of the human 
eye varies with the viewing distance; i.e., the closer we get to the screen the better we can 
see details. In order to specify the spatial frequency independent of the viewing distance, 
spatial frequency (in cycles/distance) must be normalized by the viewing distance d, 


which can be done by defining the viewing angle 0 as shown in Figure 2.4(a). 


0 /2 
Let w denote the picture width. If w/2 < d, then 3 = a =e considering 


the right triangle formed by the viewer location, an end of the picture, and the 
middle of the picture. Hence, 


EO (degrees) (2.4) 
Td 





0 一 + (radians) = 


Let f, denote the number of cycles per picture width, then the normalized hori- 
zontal spatial frequency (i.e., number of cycles per viewing degree) fo is given by 
td f, 
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The normalized vertical spatial frequency can be defined similarly in the units of 
cycles/degree. As we move away from the screen d increases, and the same number of 
cycles per picture width f , appears as a larger frequency f} per viewing degree. Since 
the human eye has reduced contrast sensitivity at higher frequencies, the same pat- 
tern is more difficult to see from a larger distance d. The horizontal and vertical reso- 
lution (number of pixels and lines) of a TV has been determined such that horizontal 
and vertical sampling frequencies are twice the highest frequency we can see (accord- 
ing to the Nyquist sampling theorem), assuming a fixed value for the ratio d/'w—i.e., 
viewing distance over picture width. Given a fixed viewing distance, clearly we need 
more video resolution (pixels and lines) as picture (screen) size increases to experi- 
ence the same video quality. 

Figure 2.4(b) shows the spatial-frequency response, which varies by the average 
luminance level, of the eye for both the luminance and chrominance components 
of still images. We see that the spatial-frequency response of the eye, in general, has 
low-pass/band-pass characteristics, and our eyes are more sensitive to higher fre- 
quency patterns in the luminance components compared with those in the chromi- 
nance components. The latter observation is the basis of the conversion from RGB 
to the luminance-chrominance space for color image processing and the reason we 
subsample the two chrominance components in color image/video compression. 
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Figure 2.4 Spatial frequency and spatial response: (a) viewing angle 
and (b) spatial-frequency response of the human eye [Mul 85]. 


Temporal-Frequency Response 


Video is displayed as a sequence of still frames. The frame rate is measured in terms of 
the number of pictures (frames) displayed per second or Hertz (Hz). The frame rates 
for cinema, television, and computer monitors have been determined according to 
the temporal-frequency response of our eyes. The human eye has lower sensitivity to 
higher temporal frequencies due to temporal integration of incoming light into the 
retina, which is also known as vision persistence. It is well known that the integration 
period is inversely proportional to the incoming light intensity. Therefore, we can see 
higher temporal frequencies on brighter screens. Psycho-visual experiments indicate 
the human eye cannot perceive flicker if the refresh rate of the display (temporal fre- 
quency) is more than 50 times per second for TV screens. Therefore, the frame rate 
for TV is set at 50-60 Hz, while the frame rate for brighter computer monitors is 72 
Hz or higher, since the brighter the screen the higher the critical flicker frequency. 


Interaction Between Spatial- and Temporal-Frequency Response 


Video exhibits both spatial and temporal variations, and spatial- and temporal- 
frequency responses of the eye are not mutually independent. Hence, we need to 
understand the spatio-temporal frequency response of the eye. The effects of chang- 
ing average luminance on the contrast sensitivity for different combinations of spatial 
and temporal frequencies have been investigated [Nes 67]. Psycho-visual experiments 
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indicate that when the temporal (spatial) frequencies are close to zero, the spatial 
(temporal) frequency response has bandpass characteristics. At high temporal (spa- 
tial) frequencies, the spatial (temporal) frequency response has low-pass character- 
istics with smaller cut-off frequency as temporal (spatial) frequency increases. This 
implies that we can exchange spatial video resolution for temporal resolution, and 
vice versa. Hence, when a video has high motion (moves fast), the eyes cannot sense 
high spatial frequencies (details) well if we exclude the effect of eye movements. 


Eye Movements 


The human eye is similar to a sphere that is free to move like a ball in a socket. If 
we look at a nearby object, the two eyes turn in; if we look to the left, the right eye 
turns in and the left eye turns out; if we look up or down, both eyes turn up or down 
together. These movements are directed by the brain [Hub 88]. There are two main 
types of gaze-shifting eye movements, saccadic and smooth pursuit, that affect the 
spatial- and spatio-temporal frequency response of the eye. Saccades are rapid move- 
ments of the eyes while scanning a visual scene. “Saccadic eye movements” enable 
us to scan a greater area of the visual scene with the high-resolution fovea of the eye. 
On the other hand, “smooth pursuit” refers to movements of the eye while tracking 
a moving object, so that a moving image remains nearly static on the high-resolution 
fovea. Obviously, smooth pursuit eye movements affect the spatio-temporal fre- 
quency response of the eye. This effect can be modeled by tracking eye movements 
of the viewer and motion compensating the contrast sensitivity function accordingly. 


2.1.4 Stereo/Depth Perception 


Stereoscopy creates the illusion of 3D depth from two 2D images, a left and a right 
image that we should view with our left and right eyes. The horizontal distance 
between the eyes (called interpupilar distance) of an average human is 6.5 cm. The 
difference between the left and right retinal images is called binocular disparity. Our 
brain deducts depth information from this binocular disparity. 3D display technolo- 
gies that enable viewing of right and left images with our right and left eyes, respec- 
tively, are discussed in Section 2.3.1. 


Accomodation, Vergence, and Visual Discomfort 


In human stereo vision, there are two oculomotor mechanisms, accommodation 
(where we focus) and vergence (where we look), which are reflex eye movements. 
Accommodation is the process by which the eye changes optical focus to maintain a 
clear image of an object as its distance from the eye varies. Vergence or convergence 
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are the movements of both eyes to make sure the image of the object being looked at 
falls on the corresponding spot on both retinas. In real 3D vision, accommodation 
and vergence distances are the same. However, in flat 3D displays both left and right 
images are displayed on the plane of the screen, which determines the accommoda- 
tion distance, while we look and perceive 3D objects at a different distance (usually 
closer to us), which is the vergence distance. This difference between accommoda- 
tion and vergence distances may cause serious discomfort if it is greater than some 
tolerable amount. The depth of an object in the scene is determined by the disparity 
value, which is the displacement of a feature point between the right and left views. 
The depth, hence the difference between accommodation and vergence distances, 
can be controlled by 3D-video (disparity) processing at the content preparation stage 
to provide a comfortable 3D viewing experience. 

Another cause of viewing discomfort is the cross-talk between the left and right 
views, which may cause ghosting and blurring. Cross-talk may result from imperfec- 
tions in polarizing filters (passive glasses) or synchronization errors (active shutters), 
but it is more prominent in auto-stereoscopic displays where the optics may not 
completely prevent cross-talk between the left and right views. 


Binocular Rivalry/Suppression Theory 


Binocular rivalry is a visual perception phenomenon that is observed when different 
images are presented to right and left eyes [Wad 96]. When the quality difference 
between the right and left views are small, according to the suppression theory of 
stereo vision, the human eye can tolerate absence of high-frequency content in one 
of the views; therefore, two views can be represented at unequal spatial resolutions 
or quality. This effect has lead to asymmetric stereo-video coding, where only the 
dominant view is encoded with high fidelity (bitrate). The results have shown that 
perceived 3D-video quality of such asymmetric processed stereo pairs is similar to 
that of symmetrically encoded sequences at higher total bitrate. They also observe 
that scaling (zoom in/out) one or both views of a stereoscopic test sequence does not 
affect depth perception. We note that these results have been confirmed on short test 
sequences. It is not known whether asymmetric view resolution or quality would 
cause viewing discomfort over longer videos with increased period of viewing. 


2.2 Digital Video 


We have experienced a digital media revolution in the last couple of decades. TV 
and cinema have gone all-digital and high-definition, and most movies and some TV 
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broadcasts are now in 3D format. High-definition digital video has landed on lap- 
tops, tablets, and cellular phones with high-quality media streaming over the Inter- 
net. Apart from the more robust form of the digital signal, the main advantage of 
digital representation and transmission is that they make it easier to provide a diverse 
range of services over the same network. Digital video brings broadcasting, cinema, 
computers, and communications industries together in a truly revolutionary man- 
ner, where telephone, cable TV, and Internet service providers have become fierce 
competitors. A single device can serve as a personal computer, a high-definition TV, 
and a videophone. We can now capture live video on a mobile device, apply digital 
processing on a laptop or tablet, and/or print still frames at a local printer. Other 
applications of digital video include medical imaging, surveillance for military and 
law enforcement, and intelligent highway systems. 


2.2.1 Spatial Resolution and Frame Rate 


Digital-video systems use component color representation. Digital color cameras 
provide individual RGB component outputs. Component color video avoids the 
artifacts that result from analog composite encoding. In digital video, there is no 
need for blanking or sync pulses, since it is clear where a new line starts given the 
number of pixels per line. 

The horizontal and vertical resolution of digital video is related to the pixel sam- 
pling density, i.e., the number of pixels per unit distance. The number of pixels per 
line and the number of lines per frame is used to classify video as standard, high, or 
ultra-high defnition, as depicted in Figure 2.5. In low-resolution digital video, pixel - 
lation (aliasing) artifact arises due to lack of sufficient spatial resolution. It manifests 
itself as jagged edges resulting from individual pixels becoming visible. The visibility 
of pixellation artifacts varies with the size of the display and the viewing distance. 
This is quite different from analog video where the lack of spatial-resolution results 
in blurring of image in the respective direction. 

The frame/field rate is typically 50/60 Hz, although some displays use frame inter- 
polation to display at 100/120, 200 or even 400 Hz. The notation 50i (or 60i) indi- 
cates interlaced video with 50 (60) fields/sec, which corresponds to 25 (30) pictures/ 
sec obtained by weaving the two fields together. On the other hand, 50p (60p) denotes 
50 (60) full progressive frames/sec. 

The arrangement of pixels and lines in a contiguous region of the memory is 
called a bitmap. There are five key parameters of a bitmap: the starting address in 
the memory, the number of pixels per line, the pitch value, the number of lines, 
and the number of bits per pixel. The pitch value specifies the distance in memory 
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Ultra HD 
3840 x 2160 


Full HD 
1920 x 1080 
HD 1280 x 720 


SD 
720 x 576 
720 x 488 





Figure 2.5 Digital-video spatial-resolution formats. 


from the start of one line to the next. The most common use of pitch different from 
the number of pixels per line is to set pitch to the next highest power of 2, which 
may help certain applications run faster. Also, when dealing with interlaced inputs, 
setting the pitch to double the number of pixels per line facilitates writing lines from 
each field alternately in memory. This will form a “weaved frame” in a contiguous 
region of the memory. 


2.2.2 Color, Dynamic Range, and Bit-Depth 


This section addresses color representation, dynamic range, and bit-depth in digital 
images/video. 


Color Capture and Display 


Color cameras can be the three-sensor type or single-sensor type. Three-sensor cam- 
eras capture R, G, and B components using different CCD panels, using an opti- 
cal beam splitter; however, they may suffer from synchronicity problems and high 
cost, while single-sensor cameras often have to compromise spatial resolution. This is 
because a color filter array is used so that each CCD element captures one of R, G, or 
B pixels in some periodic pattern. A commonly used color filter pattern is the Bayer 
array, shown in Figure 2.6, where two out of every four pixels are green, one is red, 
and one is blue, since green signal contributes the most to the luminance channel. 
The missing pixel values in each color channel are computed by linear or adaptive 
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interpolation filters, which may result in some aliasing artifacts. Similar color filter 
array patterns are also employed in LCD/LED displays, where the human eye per- 
forms low-pass filtering to perceive a full-colored image. 


Dynamic Range 

The dynamic range of a capture device (e.g., a camera or scanner) or a display device 
is the ratio between the maximum and minimum light intensities that can be rep- 
resented. The luminance levels in the environment range from —4 log cd/m? (star- 
light) to 6 log cd/m? (sun light); i.e., the dynamic range is about 10 log units [Fer 
01]. The human eye has complex fast and slow adaptation schemes to cope with this 
large dynamic range. However, a typical imaging device (camera or display) has a 
maximum dynamic range of 300:1, which corresponds to 2.5 log units. Hence, our 
ability to capture and display a foreground object subject to strong backlighting with 
proper contrast is limited. High dynamic range (HDR) imaging aims to remedy this 
problem. 


HDR Image Capture 


HDR image capture with a standard dynamic range camera requires taking a 
sequence of pictures at different exposure levels, where raw pixel exposure data (lin- 
ear in exposure time) are combined by weighted averaging to obtain a single HDR 
image [Gra 10]. There are two possible ways to display HDR images: i) employ 





Figure 2.6 Bayer color-filter array pattern. 
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new higher dynamic range display technologies, or ii) employ local tone-mapping 
algorithms for dynamic range compression (see Chapter 3) to better render details in 
bright or dark areas on a standard display [Rei 07]. 


HDR Displays 

Recently, new display technologies that are capable of up to 50,000:1 or 4.7 log units 
dynamic range with maximum intensity 8500 cd/m?, compared to standard displays 
with contrast ratio 2 log units and maximum intensity 300 cd/m?, have been pro- 
posed [See 04]. This high dynamic range matches the human eye’s short time-scale 
(fast) adaptation capability well, which enables our eyes to capture approximately 5 
log units of dynamic range at the same time. 


Bit-Depth 


Image-intensity values at each sample are quantized for a finite-precision represen- 
tation. Today, each color component signal is typically represented with 8 bits per 
pixel, which can capture 255:1 dynamic range for a total of 24 bits/pixel and 274 
distinct colors to avoid “contouring artifacts.” Contouring results in slowly varying 
regions of image intensity due to insufficient bit resolution. Some applications, such 
as medical imaging and post-production editing of motion pictures may require 10, 
12, or more bits/pixel/color. In high dynamic range imaging, 16 bits/pixel/color is 
required to capture a 50,000:1 dynamic range, which is now supported in JPEG. 

Digital video requires much higher data rates and transmission bandwidths as 
compared to digital audio. CD-quality digital audio is represented with 16 bits/ 
sample, and the required sampling rate is 44 kHz. Thus, the resulting data rate is 
approximately 700 kbits/sec (kbps). This is multiplied by 2 for stereo audio. In com- 
parison, a high-definition TV signal has 1920 pixels/line and 1080 lines for each 
luminance frame, and 960 pixels/line and 540 lines for each chrominance frame. 
Since we have 25 frames/sec and 8 bits/pixel/color, the resulting data rate exceeds 
700 Mbps, which testifies to the statement that a picture is worth 1000 words! Thus, 
the feasibility of digital video is dependent on image-compression technology. 


2.2.3 Color Image Processing 


Color images/video are captured and displayed in the RGB format. However, they 
are often converted to an intermediate representation for efficient compression and 
processing. We review the luminance-chrominance (for compression and filtering) 
and the normalized RGB and hue-saturation-intensity (HSI) (for color-specific pro- 
cessing) representations in the following. 
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Luminance-Chrominance 


The luminance-chrominance color model was used to develop an analog color TV 
transmission system that is backwards compatible with the legacy analog black and 
white TV systems. The luminance component, denoted by Y, corresponds to the 
gray-level representation of video, while the two chrominance components, denoted 
by U and V for analog video or Cr and Cb for digital video, represent the deviation 
of color from the gray level on blue—yellow and red—cyan axes. It has been observed 
that the human visual system is less sensitive to variations (higher frequencies) in 
chrominance components (see Figure 2.4(b)). This has resulted in the subsampled 
chrominance formats, such as 4:2:2 and 4:2:0. In the 4:2.2 format, the chromi- 
nance components are subsampled only in the horizontal direction, while in 4:2:0 
they are subsampled in both directions as illustrated in Figure 2.7. Te luminance- 
chrominance representation offers higher compression efficiency, compared to the 
RGB representation due to this subsampling. 

ITU-R BT.709 defines the conversion between RGB and YCrCb representations as: 


Y =0.299 R+0.587 G+0.114B 
Cr = 0.499 R — 0.418 G — 0.0813 B+128 (2.6) 
Cb = —0.169 R — 0.331 G+ 0.499 B +128 


which states that the human visual system perceives the contribution of R-G-B to 
image intensity approximately with a 3-6-1 ratio, i.e., red is weighted by 0.3, green 
by 0.6 and blue by 0.1. 

The inverse conversion is given by 


Figure 2.7 Chrominance subsampling formats: (a) no subsampling; (b) 4:2:2; (c) 4:2:0 format. 
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R=Y+1.402 (Cr 一 128) 
G= Y —0.714 (Cr — 128) — 0.344 (Cb—128) (2.7) 
B= Y +1.772 (Cb 一 128) 


The resulting R, G, and B values must be truncated to the range (0, 255) if they fall 
outside. We note that Y-Cr-Cb is not a color space. It is a way of encoding the RGB 
information, and actual colors displayed depends on the specific RGB space used. 

A common practice in color image processing, such as edge detection, enhance- 
ment, denoising, restoration, etc., in the luminance-chrominance domain is to pro- 
cess only the luminance (Y) component of the image. There are two main reasons 
for this: i) processing R, G, and B components independently may alter the color 
balance of the image, and ii) the human visual system is not very sensitive to high fre- 
quencies in the chrominance components. Therefore, we first convert a color image 
into Y-Cr-Cb color space, then perform image enhancement, denoising, restoration, 
etc., on the Y channel only. We then transform the processed Y channel and unpro- 
cessed Cr and Cb channels back to the R-G-B domain for display. 


Normalized rgb 


Normalized rgb components aim to reduce the dependency of color represented by 
the RGB values on image brightness. They are defined by 


r=R/(R+G+B) 
g=G/(R+G+B) (2.8) 
b= B/(R+G+B) 


The normalized 7, g, b values are always within the range 0 to 1, and 
r+g+b=1 (2.9) 


Hence, they can be specified by any two components, typically by (r, g) and the third 
component can be obtained from Eqn. (2.9). The normalized rgb domain is often 
used in color-based object detection, such as skin-color or face detection. 


Example. We demonstrate how the normalized rgb domain helps to 
detect similar colors independent of brightness by means of an example: 
Let’s assume we have two pixels with (R, G, B) values (230, 180, 50) and 
(115, 90, 25). It is clear that the second pixel is half as bright as the first, 


which may be because it is in a shadow. In the normalized rgb, both pixels 
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are represented by r= 0.50, g= 0.39, and 6=0.11. Hence, it is apparent 
that they represent the same color after correcting for brightness difference 
by the normalization. 


Hue-Saturation-Intensity (HSI) 


Color features that best correlate with human perception of color are hue, satura- 
tion, and intensity. Hue relates to the dominant wavelength, saturation relates to the 
spread of power about this wavelength (purity of the color), and intensity relates to 
the perceived luminance (similar to the Y channel). There is a family of color spaces 
that specify colors in terms of hue, saturation, and intensity, known as HSI spaces. 
Conversion to HSI where each component is in the range [0,1] can be performed 
from the scaled RGB, where each component is divided by 255 so they are in the 
range [0,1]. The HSI space specifies color in cylindrical coordinates and conversion 
formulas (2.10) are nonlinear [Gon 07]. 


1 = 2 
_[ 6 ifB<G 7 V[(R-G) +(R-B)] 
ols fB>G where 0 = arccos a a 


fh 3 min{R,G, B} 


S= (2.10) 
R+G+B 

cz 
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Note that HSI is not a perceptually uniform color space, i.e., equal perturbations 
in the component values do not result in perceptually equal color variations across 
the range of component values. The CIE has also standardized some perceptually 
uniform color spaces, such as L*, u*, v* and L*, a*, b* (CIELAB). 


2.2.4 Digital-Video Standards 


Exchange of digital video between different products, devices, and applications requires 
digital-video standards. We can group digital-video standards as video-format (resolu- 
tion) standards, video-interface standards, and image/video compression standards. In 
the early days of analog TV, cinema (film), and cameras (cassette), the computer, TV, 
and consumer electronics industries established different display resolutions and scan- 
ning standards. Because digital video has brought cinema, TV, consumer electronics, 
and computer industries ever closer, standardization across industries has started. This 
section introduces recent standards and standardization efforts. 
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Video-Format Standards 


Historically, standardization of digital-video formats originated from different 
sources: ITU-R driven by the TV industry, SMPTE driven by the motion picture 
industry, and computer/consumer electronics associations. 

Digital video was in use in broadcast TV studios even in the days of analog TV, 
where editing and special effects were performed on digitized video because it is easier 
to manipulate digital images. Working with digital video avoids artifacts that would 
otherwise be caused by repeated analog recording of video on tapes during various 
production stages. Digitization of analog video has also been needed for conversion 
between different analog standards, such as from PAL to NTSC, and vice versa. 
ITU-R (formerly CCIR) Recommendation BT.601 defines a standard definition TV 
(SDTV) digital-video format for 525-line and 625-line TV systems, also known 
as digital studio standard, which is originally intended to digitize analog TV sig- 
nals to permit digital post-processing as well as international exchange of programs. 
This recommendation is based on component video with one luminance (Y) and 
two chrominance (Cr and Cb) signals. The sampling frequency for analog-to-digital 
(A/D) conversion is selected to be an integer multiple of the horizontal sweep fre- 
quencies (line rates) F525 = 525 = 29.97 = 15,734 and fi 625 = 625 X 25 = 15,625 
in both 525- and 625-line systems. Thus, for the luminance 
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i.e., 525 and 625 line systems have 858 and 864 samples/line, respectively, and for 
chrominance 
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ITU-R BT.601 standards for both 525- and 625-line SDTV systems employ 
interlaced scan, where the raw data rate is 165.9 Mbps. The parameters of both for- 
mats are shown in Table 2.1. Historically, interlaced SDTV was displayed on analog 
cathode ray tube (CRT) monitors, which employ interlaced scanning at 50/60 Hz. 
Today, flat-panel displays and projectors can display video at 100/120 Hz interlace 
or progressive mode, which requires scan-rate conversion and de-interlacing of the 
50i/60i ITU-R BT.601 [ITU 11] broadcast signals. 

Recognizing that the resolution of SDTV is well behind today’s technology, a new 
high-definition TV (HDTV) standard, ITU-R BT.709-5 [ITU 02], which doubles 
the resolution of SDTV in both horizontal and vertical directions, has been approved 
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with three picture formats: 720p, 1080i, and 1080p. Table 2.1 shows their param- 
eters. Today broadcasters use either 720p/50/60 (called HD) or 1080i/25/29.97 
(called FullHD). There are no broadcasts in 1080p format at this time. Note that 
many 1080i/25 broadcasts use horizontal sub-sampling to 1440 pixels/line to save 
bitrate. 720p/50 format has full temporal resolution 50 progressive frames per 
second (with 720 lines). Note that most international HDTV events are captured in 
either 1080i/25 or 1080i/29.97 (for 60 Hz countries) and presenting 1080i/29.97 
in 50 Hz countries or vice versa requires scan rate conversion. For 1080i/25 content, 
720p/50 broadcasters will need to de-interlace the signal before transmission, and 
for 1080i/29.97 content, both de-interlacing and frame-rate conversion is required. 
Furthermore, newer 1920 X 1080 progressive scan consumer displays require up- 
scaling 1280 X 720 pixel HD broadcast and 1440 X 1080i/25 sub-sampled FullHD 
broadcasts. 

In the computer and consumer electronics industry, standards for video-display 
resolutions are set by a consortia of organizations such as Video Electronics Standards 
Association (VESA) and Consumer Electronics Association (CEA). The display stan- 
dards can be grouped as Video Graphics Array (VGA) and its variants and Extended 
Graphics Array (XGA) and its variants. The favorite aspect ratio of the display indus- 
try has shifted from the earlier 4:3 to 16:10 and 16:9. Some of these standards are 
shown in Table 2.2. The refresh rate was an important parameter for CRT monitors. 
Since activated LCD pixels do not flash on/off between frames, LCD monitors do 
not exhibit refresh-induced flicker. The only part of an LCD monitor that can pro- 
duce CRT-like flicker is its backlight, which typically operates at 200 Hz. 

Recently, standardization across TV, consumer electronics, and computer indus- 
tries has started, resulting in the so-called convergence enabled by digital video. For 
example, some laptops and cellular phones now feature 1920 X 1080 progressive 


Table 2.1 ITU-R TV Broadcast Standards 


Interlace/Progressive, 

Standard Pixels Lines Picture Rate Aspect Ratio 
BT.601-7 480i 720 486 2:1 Interlace, 30 Hz (60 fields/s) 4:3, 16:9 
BT.601-7 576i 720 576 2:1 Interlace, 25 Hz (50 fields/s) 4:3, 16:9 
BT.709-5 720p 1280 720 Progressive, 50 Hz, 60 Hz 16:9 
BT.709-5 1080i 1920 1080 2:1 Interlace, 25 Hz, 30 Hz 16:9 
BT.709-5 1080p 1920 1080 Progressive 16:9 
BT.2020 2160p 3840 2160 Progressive 16:9 


BT.2020 4320p 7680 4320 Progressive 16:9 
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mode, which is a format jointly supported by TV, consumer electronics, and com- 
puter industries. 

Ultra-high definition television (UHDTV) is the most recent standard proposed 
by NHK Japan and approved as ITU-R BT.2020 [ITU 12]. It supports the 4K 
(2160p) and 8K (4320p) digital-video formats shown in Table 2.1. The Consumer 
Electronics Association announced that “ultra high-definition” or “ultra HD” or 
“UHD” would be used for displays that have an aspect ratio of at least 16:9 and at 
least one digital input capable of carrying and presenting native video at a minimum 
resolution of 3,840 X 2,160 pixels. The ultra-HD format is very similar to 4K digital 
cinema format (see Section 2.4.2) and may become an across industries standard in 
the near future. 


Video-Interface Standards 


Digital-video interface standards enable exchange of uncompressed video between 
various consumer electronics devices, including digital TV monitors, computer 
monitors, blu-ray devices, and video projectors over cable. Two such standards are 
Digital Visual Interface (DVI) and High-Definition Multimedia Interface (HDMI). 
HDMI is the most popular interface that enables transfer of video and audio on 
a single cable. It is backward compatible with DVI-D or DVI-I. HDMI 1.4 and 
higher support 2160p digital cinema and 3D stereo transfer. 


Image- and Video-Compression Standards 


Various digital-video applications, e.g., SDTV, HDTV, 3DTV, video on demand, 


Table 2.2 Display Standards 
Standard Pixels Lines Aspect Ratio 


VGA 640 480 4:3 
WSVGA 1024 576 16:9 
XGA 1024 768 4:3 
WXGA 1366 768 16:9 
SXGA 1280 1024 5:4 
UXGA 1600 1200 4:3 
FHD 1920 1080 16:9 
WUXGA 1920 1200 16:10 
HXGA 4096 3072 4:3 
WQUXGA 3840 2400 16:10 


WHUXGA 7680 4800 16:10 
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interactive games, and videoconferencing, reach potential users over either broadcast 
channels or the Internet. Digital cinema content must be transmitted to movie the- 
atres over satellite links or must be shipped in harddisks. Raw (uncompressed) data 
rates for digital video are prohibitive, since uncompressed broadcast HDTV requires 
over 700 Mbits/s and 2K digital cinema data exceeds 5 Gbits/sec in uncompressed 
form. Hence, digital video must be stored and transmitted in compressed form, 
which leads to compression standards. 

Video compression is a key enabling technology for digital video. Standardization 
of image and video compression is required to ensure compatibility of digital-video 
products and hardware by different vendors. As a result, several video-compression 
standards have been developed, and work for even more efficient compression is 
ongoing. Major standards for image and video compression are listed in Table 2.3. 

Historically, standardization in digital-image communication started with the 
ITU-T (formerly CCITT) digital fax standards. The ITU-T Recommendation T.4 
using 1D coding for digital fax transmission was ratified in 1980. Later, a more 
efficient 2D compression technique was added as an option to the ITU-T recom- 
mendation T.30 and ISO JBIG was developed to fix some of the problems with the 
ITU-T Group 3 and 4 codes, mainly in the transmission of half-tone images. 

JPEG was the first color still-image compression standard. It has also found some 
use in frame-by-frame video compression, called motion JPEG, mostly because of 
its wide availability in hardware. Later JPEG2000 was developed as a more efficient 
alternative especially at low bit rates. However, it has mainly found use in the digital 
cinema standards. 

The first commercially successful video-compression standard was MPEG-1 for 
video storage on CD, which is now obsolete. MPEG-2 was developed for compres- 
sion of SDTV and HDTV as well as video storage in DVD and was the enabling 


Table 2.3 International Standards for Image/Video Compression 


Standard Application 

ITU-T (formerly CCITT) G3/G4 FAX, Binary images 

ISO JBIG Binary/halftone, gray-scale images 
ISO JPEG Still images 

ISO JPEG2000 Digital cinema 

ISO MPEG2 Digital video, SDTV, HDTV 
ISO MPEG4 AVC/ITU-T H.264 Digital video 


ISO HEVC/ ITU-T H.265 HD video, HDTV, UHDTV 
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technology of digital TV. MPEG-4 AVC and HEVC were later developed as more 
efficient compression standards especially for HDTV and UHDTYV as well as video 
on blu-ray discs. We discuss image- and video-compression technologies and stan- 
dards in detail in Chapter 7 and Chapter 8, respectively. 


2.3 3D Video 


3D cinema has gained wide acceptance in theatres as many movies are now produced 
in 3D. Flat-panel 3DTV has also been positively received by consumers for watching 
sports broadcasts and blu-ray movies. Current 3D-video displays are stereoscopic 
and are viewed by special glasses. Stereo-video formats can be classified as frame- 
compatible (mainly for broadcast TV) and full-resolution (sequential) formats. 
Alternatively, multi-view and super multi-view 3D-video displays are currently being 
developed for autostereoscopic viewing. Multi-view video formats without accompa- 
nying depth information require extremely high data rates. Multi-view-plus-depth 
representation and compression are often preferred for efficient storage and trans- 
mission of multi-view video as the number of views increases. There are also volu- 
metric, holoscopic (integral imaging), and holographic 3D-video formats, which are 
mostly considered as futuristic at this time. 

The main technical obstacles for 3DTV and video to achieve much wider accep- 
tance at home are: i) developing affordable, free-viewing natural 3D display tech- 
nologies with high spatial, angular, and depth resolution, and ii) capturing and 
producing 3D content in a format that is suitable for these display technologies. We 
discuss 3D display technologies and 3D-video formats in more detail below. 


2.3.1 3D-Display Technologies 


A 3D display should ideally reproduce a light field that is an indistinguishable copy 
of the actual 3D scene. However, this is a rather difficult task to achieve with today’s 
technology due to very large amounts of data that needs to be captured, processed, 
and stored/transmitted. Hence, current 3D displays can only reproduce a limited set 
of 3D visual cues instead of the entire light field; namely, they reproduce: 


。 Binocular depth — Binocular disparity in a stereo pair provides relative depth 
cue. 3D displays that present only two views, such as stereo TV and digital 
cinema, can only provide binocular depth cue. 

。 Head-motion parallax — Viewers expect to see a scene or objects from a slightly 
different perspective when they move their head. Multi-view, light-field, or vol- 
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umetric displays can provide head-motion parallax, although most displays can 
provide only limited parallax, such as only horizontal parallax. 


We can broadly classify 3D display technologies as multiple-image (stereo_ 
scopic and auto-stereoscopic), light-field, and volumetric displays, as summarized in 
Figure 2.8. Multiple-image displays present two or more images of a scene by some 
multiplexing of color sub-pixels on a planar screen such that the right and left eyes 
see two separate images with binocular disparity, and rely upon the brain to fuse the 
two images to create the sensation of 3D. Light-field displays present light rays as if 
they are originating from a real 3D object/scene using various technologies such that 
each pixel of the display can emit multiple light rays with different color, intensity, 
and directions, as opposed to multiplexing pixels among different views. Volumetric 
displays aim to reconstruct a visual representation of an object/scene using voxels 
with three physical dimensions via emission, scattering, or relaying of light from a 
well-defined region in the physical (x,,x,,x,) space, as opposed to displaying light 


rays emitted from a planar screen. 


Multiple-Image Displays 
Multiple-image displays can be classified as those that require glasses (stereoscopic) 
and those that don’t (auto-stereoscopic). 

Stereoscopic displays present two views with binocular disparity, one for the left 
and one for the right eye, from a single viewpoint. Glasses are required to ensure that 
only the right eye sees the right view and the left eye sees the left view. The glasses 
can be passive or active. Passive glasses are used for color (wavelength) or polarization 
multiplexing of the two views. Anaglyph is the oldest form of 3D display by color 
multiplexing using red and cyan filters. Polarization multiplexing applies horizontal 
and vertical (linear), or clockwise and counterclockwise (circular) polarization to the 
left and right views, respectively. Glasses apply matching polarization to the right 
and left eyes. The display shows both left and right views laid over each other with 


polarization matching that of the glasses in every frame. This will lead to some loss of 
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Super multi-view | Static volume 
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Figure 2.8 Classification of 3D-display technologies. 
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spatial resolution since half of the sub-pixels in the display panel will be allocated to 
the left and right views, respectively, using polarized filters. Active glasses (also called 
active shutter) present the left image to only the left eye by blocking the view of the 
right eye while the left image is being displayed and vice versa. The display alternates 
full-resolution left and right images in sequential order. The active 3D system must 
assure proper synchronism between the display and glasses. 3D viewing with passive 
or active glasses is the most developed and commercially available form of 3D display 
technology. We note that two-view displays lack head-motion parallax and can only 
provide 3D viewing from a single point of view (from the point where the right and 
left views have actually been captured) no matter from which angle the viewer looks 
at the screen. Furthermore, polarization may cause loss of some light due to polariza- 
tion filter absorption, which may affect scene brightness. 

Auto-stereoscopic displays do not require glasses. They can display two views or 
multiple views. Separation of views can be achieved by different optics technologies, 
such as parallax barriers or lenticular sheets, so that only certain rays are emitted in 
certain directions. They can provide head-motion parallax, in addition to binocular 
depth cues, by either using head-tracking to display two views generated according 
to head/eye position of the viewer or displaying multiple fixed views. In the former, 
the need for head-tracking, real-time view generation, and dynamic optics to steer 
two views in the direction of the viewer gaze increases hardware complexity. In the 
latter, continuous-motion parallax is not possible with a limited number of views, 
and proper 3D vision is only possible from some select viewing positions, called 
sweet spots. In order to determine the number of views, we divide the head-motion 
range into 2 cm intervals (zones) and present a view for each zone. Then, images seen 
by the left and right eyes (separated by 6 cm) will be separated by three views. If we 
allow 4-5 cm head movement toward the left and right, then the viewing range can 
be covered by a total of eight or nine views. The major drawbacks of autostereoscopic 
multi-view displays are: i) multiple views are displayed over the same physical screen, 
sharing sub-pixels between views in a predetermined pattern, which results in loss of 
spatial resolution; ii) cross-talk between multiple views is unavoidable due to limita- 
tions of optics; and iii) there may be noticeable parallax jumps from view to view 
with a limited number of viewing zones. Due to these reasons, auto-stereoscopic 
displays have not entered the mass consumer market yet. 

State-of-the art stereoscopic and auto-stereoscopic displays have been reviewed 
in [Ure 11]. Detailed analysis of stereoscopic and auto-stereoscopic displays from a 
signal-processing perspective and their quality profiles are provided in [Boe 13]. 
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Light-Field and Holographic Displays 

Super multi-view (SMV) displays can display up to hundreds of views of a scene 
taken from different angles (instead of just a right and left view) to create a see- 
around effect as the viewer slightly changes his/her viewing (gaze) angle. SMV 
displays employ more advanced optical technologies than just allocating certain 
sub-pixels to certain views [Ure 11]. The characteristic parameters of a light-field 
display are spatial, angular, and perceived depth resolution. If the number of views 
is sufficiently large such that viewing zones are less than 3 mm, two or more views 
can be displayed within each eye pupil to overcome the accommodation-vergence 
conflict and offer a real 3D viewing experience. Quality measures for 3D light-field 
displays have been studied in [Kov 14]. 

Holographic imaging requires capturing amplitude (intensity), phase differences 
(interference pattern), and wavelength (color) of a light field using a coherent light 
source (laser). Holoscopic imaging (or integral imaging) does not require a coherent 
light source, but employs an array of microlenses to capture and reproduce a 4D 
light field, where each lens shows a different view depending on the viewing angle. 


Volumetric Displays 

Different volumetric display technologies aim at creating a 3D viewing experience 
by means of rendering illumination within a volume that is visible to the unaided 
eye either directly from the source or via an intermediate surface such as a mir- 
ror or glass, which can undergo motion such as oscillation or rotation. They can 
be broadly classified as swept-volume displays and static volume displays. Swept- 
volume 3D displays rely on the persistence of human vision to fuse a series of slices 
of a 3D object, which can be rectangular, disc-shaped, or helical cross-sectioned, into 
a single 3D image. Static-volume 3D displays partition a finite volume into address- 
able volume elements, called voxels, made out of active elements that are transparent 
in “off? state but are either opaque or luminous in “on” state. The resolution of a 
volumetric display is determined by the number of voxels. It is possible to display 
scenes with viewing-position-dependent effects (e.g., occlusion) by including trans- 
parency (alpha) values for voxels. However, in this case, the scene may look distorted 
if viewed from positions other than those it was generated for. 

The light-field, volumetric, and holographic display technologies are still being 
developed in major research laboratories around the world and cannot be considered 
as mature technologies at the time of writing. Note that light-field and volumetric- 
video representations require orders of magnitude more data (and transmission 
bandwidth) compared to stereoscopic video. In the following, we cover representa- 
tions for two-view, multi-view, and super multi-view video. 
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2.3.2 Stereoscopic Video 


Stereoscopic two-view video formats can be classified as frame-compatible and full- 


resolution formats. 
Frame-compatible stereo-video formats have been developed to provide 3DTV 


services over existing digital TV broadcast infrastructures. They employ pixel sub- 
sampling in order to keep the frame size and rate the same as that of monocular 2D 
video. Common sub-sampling patterns include side-by-side, top-and-bottom, line 
interleaved, and checkerboard. Side-by-side format, shown in Figure 2.9(a), applies 
horizontal subsampling to the left and right views, reducing horizontal resolution 
by 50%. The subsampled frames are then put together side-by-side. Likewise, top- 
and-bottom format, shown in Figure 2.9(b), vertically subsamples the left and right 
views, and stitches them over-under. In the line-interleaved format, the left and right 
views are again sub-sampled vertically, but put together in an interleaved fashion. 
Checkerboard format sub-samples left and right views in an offset grid pattern and 
multiplexes them into a single frame in a checkerboard layout. Among these formats, 
side-by-side and top-and-bottom are selected as mandatory for broadcast by the lat- 
est HDMI specification 1.4a [HDM 13]. Frame-compatible formats are also sup- 
ported by the stereo and multi-view extensions of the most recent joint MPEG and 
ITU video-compression standards such as AVC and HEVC (see Chapter 8). 

The two-view full resolution stereo is the format of choice for movie and game 
content. Frame packing, which is a supported format in the HDMI specification 
version 1.4a, stores frames of left and right views sequentially, without any change 
in resolution. This full HD stereo-video format requires, in the worst case, twice 
as much bandwidth as that of monocular video. The extra bandwidth requirement 
may be kept around 50% by using the Multi-View Video Coding (MVC) standard, 
which is selected by the Blu-ray Disc Association as the coding format for 3D video. 


2.3.3 Multi-View Video 


Multi-view and super multi-view displays employ multi-view video represen- 
tations with varying number of views. Since the required data rate increases lin- 
early with the number of views, depth-based representations are more efficient for 


(a) (b) 
Figure 2.9 Frame compatible formats: (a) side-by-side; (b) top-bottom. 
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multi-view video with more than a few views. Depth-based representations also 
enable: i) generation of desired intermediate views that are not present among the 
original views by using depth-image based rendering (DIBR) techniques, and ii) easy 
manipulation of depth effects to adjust vergence vs. accommodation conflict for best 
viewing comfort. 

View-plus-depth has initially been proposed as a stereo-video format, where a 
single view and associated depth map are transmitted to render a stereo pair at the 
decoder. It is backward compatible with legacy video using a layered bit stream with 
an encoded view and encoded depth map as a supplementary layer. MPEG speci- 
fied a container format for view-plus-depth data, called MPEG-C Part 3 [MPG 07], 
which was later extended to multi-view-video-plus-depth (MVD) format [Smo 11], 
where N views and N depth maps are encoded and transmitted to generate M views 
at the decoder, with N= M. Te MVD format is illustrated in Figure 2.10, where 
only 6 views and 6 depth maps per frame are encoded to reconstruct 45 views per 
frame at the decoder side by using DIBR techniques. 

The depth information needs to be accurately captured/computed, encoded, and 
transmitted in order to render intermediate views accurately using the received refer- 
ence view and depth map. Each frame of the depth map conveys the distance of the 
corresponding video pixel from the camera. Scaled depth values, represented by 8 
bits, can be regarded as a separate gray-scale video, which can be compressed very 
efficiently using state-of-the-art video codecs. Depth map typically requires 15—20% 





45 Virtual Intermediate Views 


Figure 2.10 N-view + N depth-map format (courtesy of Aljoscha Smolic). 
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of the bitrate necessary to encode the original video due to its smooth and less- 
structured nature. 

A difficulty with the view-plus-depth format is generation of accurate depth 
maps. Although there are time-of-flight cameras that can generate depth or dis- 
parity maps, they typically offer limited performance in outdoors environments. 
Algorithms for depth and disparity estimation by image rectification and disparity 
matching have been studied in the literature [Kau 07]. Another difficulty is the 
appearance of regions in the rendered views, which are occluded in the available 
views. These disocclusion regions may be concealed by smoothing the original depth- 
map data to avoid appearance of holes. Also, it is possible to use multiple view-plus- 
depth data to prevent disocclusions [Mul 11]. An extension of the view-plus-depth, 
which allows better modeling of occlusions, is the layered depth video (LDV). LDV 
provides multiple depth values for each pixel in a video frame. 

While high-definition digital-video products have gained universal user accep- 
tance, there are a number of challenges to overcome in bringing 3D video to con- 
sumers. Most importantly, advances in autostereoscopic (without glasses) multi-view 
display technology will be critical for practical usability and consumer acceptance of 
3D viewing technology. Availability of high-quality 3D content at home is another 
critical factor. In summary, both content creators and display manufacturers need 
further effort to provide consumers with a high-quality 3D experience without view- 
ing discomfort or fatigue and high transition costs. It seems that the TV/consumer 
electronics industry has moved its focus to bringing ultra-high-definition products 
to consumers until there is more progress with these challenges. 


2.4 Digital-Video Applications 


Main consumer applications for digital video include digital TV broadcasts, digital 
cinema, video playback from DVD or blu-ray players, as well as video streaming and 
videoconferencing over the Internet (wired or wireless) [Pit 13]. 


2.4.1, Digital TV 


A digital TV (DTV) broadcasting system consists of video/audio compression, mul- 
tiplex and transport protocols, channel coding, and modulation subsystems. The 
biggest single innovation that enabled digital TV services has been advances in video 
compression since the 1990s. Video-compression standards and algorithms are cov- 
ered in detail in Chapter 8. Video and audio are compressed separately by different 
encoders to produce video and audio packetized elementary streams (PES). Video and 
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audio PES and related data are multiplexed into an MPEG program stream (PS). 
Next, one or more PSs are multiplexed into an MPEG transport stream (TS). TS 
packets are 188-bytes long and are designed with synchronization and recovery in 
mind for transmission in lossy environments. The TS is then modulated into a signal 
for transmission. Several different modulation methods exist that are specific to the 
medium of transmission, which are terrestial (fixed reception), cable, satellite, and 
mobile reception. 

There are different digital TV broadcasting standards that are deployed globally. 
Although they all use MPEG-2 or MPEG-4 AVC/H.264 video compression, more 
or less similar audio coding, and the same transport stream protocol, their chan- 
nel coding, transmission bandwidth and modulation systems differ slightly. These 
include the Advanced Television System Committee (ATSC) in the USA, Digital 
Video Broadcasting (DVB) in Europe, Integrated Multimedia Broadcasting (ISDB) 
in Japan, and Digital Terrestial Multimedia Broadcasting in China. 


ATSC Standards 
The first DTV standard was ATSC Standard A/53, which was published in 1995 


and was adopted by the Federal Communications Commission in the United 
States in 1996. This standard supported MPEG-2 Main profile video encoding 
and 5.1-channel surround sound using Dolby Digital AC-3 encoding, which was 
standardized as A/52. Support for AVC/H.264 video encoding was added with the 
ATSC Standard A/72 that was approved in 2008. ATSC signals are designed to use 
the same 6 MHz bandwidth analog NTSC television channels. Once the digital 
video and audio signals have been compressed and multiplexed, ATSC uses a 188- 
byte MPEG transport stream to encapsulate and carry several video and audio pro- 
grams and metadata. The transport stream is modulated differently depending on 
the method of transmission: 


¢ Terrestrial broadcasters use 8-VSB modulation that can transmit at a maximum 
rate of 19.39 Mbit/s. ATSC 8-VSB transmission system adds 20 bytes of Reed- 
Solomon forward-error correction to create packets that are 208 bytes long. 

。 Cable television stations operate at a higher signal-to-noise ratio than terres- 
tial broadcasters and can use either 16-VSB (defined by ATSC) or 256-QAM 
(defined by Society of Cable Telecommunication Engineers) modulation to 
achieve a throughput of 38.78 Mbit/s, using the same 6-MHz channel. 

。 There is also an ATSC standard for satellite transmission; however, direct- 
broadcast satellite systems in the United States and Canada have long used 
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either DVB-S (in standard or modified form) or a proprietary system such as 
DSS (Hughes) or DigiCipher 2 (Motorola). 


The receiver must demodulate and apply error correction to the signal. Then, the 
transport stream may be de-multiplexed into its constituent streams before audio 
and video decoding. 

The newest edition of the standard is ATSC-3.0, which sees the HEVC/H.265 
video codec, with OFDM instead of 8-VSB for terrestial modulation, allowing for 
28 Mbps or more of bandwidth on a single 6-MHz channel. 


DVB Standards 


DVB is a suite of standards, adopted by the European Telecommunications Stan- 
dards Institute (ETSI) and supported by European Broadcasting Union (EBU), 
which defines the physical layer and data-link layer of the distribution system. The 
DVB texts are available on the ETSI website. They are specific for each medium of 
transmission, which we briefly review. 


DVB-T and DVB-T2 


DVB-T is the DVB standard for terrestrial broadcast of digital television and was first 
published in 1997. It specifies transmission of MPEG transport streams, containing 
MPEG-2 or H.264/MPEG-4 AVC compressed video, MPEG-2 or Dolby Digital 
AC-3 audio, and related data, using coded orthogonal frequency-division multiplex- 
ing (COFDM) or OFDM modulation. Rather than carrying data on a single radio 
frequency (RF) channel, COFDM splits the digital data stream into a large number 
of lower rate streams, each of which digitally modulates a set of closely spaced adjacent 
sub-carrier frequencies. There are two modes: 2K-mode (1,705 sub-carriers that are 
4 kHz apart) and 8K-mode (6,817 sub-carriers that are 1 kHz apart). DVB-T offers 
three different modulation schemes (QPSK, 16QAM, 64QAM). It was intended for 
DTV broadcasting using mainly VHF 7 MHz and UHF 8 MHz channels. The first 
DVB-T broadcast was realized in the UK in 1998. The DVB-T2 is the extension 
of DVB-T that was published in June 2008. With several technical improvements, 
it provides a minimum 30% increase in payload, under similar channel conditions 


compared to DVB-T. The ETSI adopted the DVB-T2 in September 2009. 


DVB-S and DVB-S2 


DVB-S is the original DVB standard for satellite television. Its first release dates back 
to 1995, while development lasted until 1997. The standard only specifies physical 
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link characteristics and framing for delivery of MPEG transport stream (MPEG-TS) 
containing MPEG-2 compressed video, MPEG-2 or Dolby Digital AC-3 audio, 
and related data. The first commercial application was in Australia, enabling digi- 
tally broadcast, satellite-delivered television to the public. DVB-S has been used in 
both multiple-channel per carrier and single-channel per carrier modes for broadcast 
network feeds and direct broadcast satellite services in every continent of the world, 
including Europe, the United States, and Canada. 

DVB-S2 is the successor of the DVB-S standard. It was developed in 2003 and 
ratified by the ETSI in March 2005. DVB-S2 supports broadcast services including 
standard and HDTYV, interactive services including Internet access, and professional 
data content distribution. The development of DVB-S2 coincided with the intro- 
duction of HDTV and H.264 (MPEG-4 AVC) video codecs. Two new key features 
that were added compared to the DVB-S standard are: 


e A powerful coding scheme, Irregular Repeat-Accumulate codes, based on a 
modern LDPC code, with a special structure for low encoding complexity. 

。 Variable coding and modulation (VCM) and adaptive coding and modula- 
tion (ACM) modes to optimize bandwidth utilization by dynamically changing 
transmission parameters. 


Other features include enhanced modulation schemes up to 32-APSK, addi- 
tional code rates, and introduction of a generic transport mechanism for IP packet 
data including MPEG-4 AVC video and audio streams, while supporting backward 
compatibility with existing DVB-S transmission. The measured DVB-S2 perfor- 
mance gain over DVB-S is around a 30% increase of available bitrate at the same 
satellite transponder bandwidth and emitted signal power. With improvements in 
video compression, an MPEG-4 AVC HDTV service can now be delivered in the 
same bandwidth used for an early DVB-S based MPEG-2 SDTV service. In March 
2014, the DVB-S2X specification was published as an optional extension adding 
further improvements. 


DVB-C and DVB-C2 


The DVB-C standard is for broadcast transmission of digital television over cable. 
This system transmits an MPEG-2 or MPEG-4 family of digital audio/digital video 
stream using QAM modulation with channel coding. The standard was first pub- 
lished by the ETSI in 1994, and became the most widely used transmission system 
for digital cable television in Europe. It is deployed worldwide in systems ranging 
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from larger cable television networks (CATV) to smaller satellite master antenna TV 
(SMATV) systems. 

The second-generation DVB cable transmission system DVB-C2 specification 
was approved in April 2009. DVB-C2 allows bitrates up to 83.1 Mbit/s on an 8 
MHz channel when using 4096-QAM modulation, and up to 97 Mbit/s and 110.8 
Mbit/s per channel when using 16384-QAM and 65536-AQAM modulation, 
respectively. By using state-of-the-art coding and modulation techniques, DVB-C2 
offers more than a 30% higher spectrum efficiency under the same conditions, and 
the gains in downstream channel capacity are greater than 60% for optimized HFC 
networks. These results show that the performance of the DVB-C2 system gets so 
close to the theoretical Shannon limit that any further improvements would most 
likely not be able to justify the introduction of a disruptive third generation cable- 
transmission system. 

There is also a DVB-H standard for terrestrial mobile TV broadcasting to hand- 
held devices. The competitors of this technology have been the 3G cellular-system- 
based MBMS mobile-TV standard, the ATSC-M/H format in the United States, 
and the Qualcomm MediaFLO. DVB-SH (satellite to handhelds) and DVB-NGH 
(Next Generation Handheld) are possible future enhancements to DVB-H. How- 


ever, none of these technologies have been commercially successful. 


2.4.2 Digital Cinema 


Digital cinema refers to digital distribution and projection of motion pictures as 
opposed to use of motion picture film. A digital cinema theatre requires a digital pro- 
jector (instead of a conventional film projector) and a special computer server. Mov- 
ies are supplied to theatres as digital files, called a Digital Cinema Package (DCP), 
whose size is between 90 gigabytes (GB) and 300 GB for a typical feature movie. The 
DCP may be physically delivered on a hard drive or can be downloaded via satellite. 
The encrypted DCP file first needs to be copied onto the server. The decryption keys, 
which expire at the end of the agreed upon screening period, are supplied separately 
by the distributor. The keys are locked to the server and projector that will screen the 
film; hence, a new set of keys are required to show the movie on another screen. The 
playback of the content is controlled by the server using a playlist. 


Technology and Standards 


Digital cinema projection was first demonstrated in the United States in October 
1998 using Texas Instruments’ DLP projection technology. In January 2000, the 
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Society of Motion Picture and Television Engineers, in North America, initiated a 
group to develop digital cinema standards. The Digital Cinema Initiative (DCI), a 
joint venture of six major studios, was established in March 2002 to develop a system 
specification for digital cinema to provide robust intellectual property protection 
for content providers. DCI published the first version of a specification for digital 
cinema in July 2005. Any DCI-compliant content can play on any DCI-compliant 
hardware anywhere in the world. 

Digital cinema uses high-definition video standards, aspect ratios, or frame rates 
that are slightly different than HDTV and UHDTV. The DCI specification sup- 
ports 2K (2048 X 1080 or 2.2 Mpixels) at 24 or 48 frames/sec and 4K (4096 X 2160 
or 8.8 Mpixels) at 24 frames/sec modes, where resolutions are represented by the 
horizontal pixel count. The 48 frames/sec is called high frame rate (HFR). The speci- 
fication employs the ISO/IEC 15444-1 JPEG2000 standard for picture encoding, 
and the CIE XYZ color space is used at 12 bits per component encoded with a 2.6 
gamma applied at projection. It ensures that 2K content can play on 4K projectors 
and vice versa. 


Digital Cinema Projectors 


Digital cinema projectors are similar in principle to other digital projectors used in 
the industry. However, they must be approved by the DCI for compliance with the 
DCI specifications: i) they must conform to the strict performance requirements, 
and ii) they must incorporate anti-piracy protection to protect copyrights. Major 
DCl-approved digital cinema projector manufacturers include Christie, Barco, 
NEC, and Sony. The first three manufactuers have licensed the DLP technology 
from Texas Instruments, and Sony uses its own SXRD technology. DLP projectors 
were initially available in 2K mode only. DLP projectors became available in both 
2K and 4K in early 2012, when Texas Instruments’ 4K DLP chip was launched. 
Sony SXRD projectors are only manufactured in 4K mode. 

DLP technology is based on digital micromirror devices (DMDs), which are 
chips whose surface is covered by a large number of microscopic mirrors, one for 
each pixel; hence, a 2K chip has about 2.2 million mirrors and a 4K chip about 8.8 
million. Each mirror vibrates several thousand times a second between on and off 
positions. The proportion of the time the mirror is in each position varies according 
to the brightness of each pixel. Three DMD devices are used for color projection, 
one for each of the primary colors. Light from a Xenon lamp, with power between 
1 kW and 7 kW, is split by color filters into red, green, and blue beams that are 
directed at the appropriate DMD. 
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Transition to digital projection in cinemas is ongoing worldwide. According to 
the National Association of Theatre Owners, 37,711 screens out of 40,048 in the 
United States had been converted to digital and about 15,000 were 3D capable as 
of May 2014. 


3D Digital Cinema 


The number of 3D-capable digital cinema theatres is increasing with wide interest of 
audiences in 3D movies and an increasing number of 3D productions. A 3D-capable 
digital cinema video projector projects right-eye and left-eye frames sequentially. The 
source video is produced at 24 frames/sec per eye; hence, a total of 48 frames/sec 
for right and left eyes. Each frame is projected three times to reduce flicker, called 
triple flash, for a total of 144 times per second. A silver screen is used to maintain 
light polarization upon reflection. There are two types of stereoscopic 3D viewing 
technology where each eye sees only its designated frame: i) glasses with polarizing 
filters oriented to match projector filters, and ii) glasses with liquid crystal (LCD) 
shutters that block or transmit light in sync with the projectors. These technologies 
are provided under the brands RealD, Masterlmage, Dolby 3D, and XpanD. 

The polarization technology combines a single 144-Hz digital projector with 
either a polarizing filter (for use with polarized glasses and silver screens) or a filter 
wheel. RealD 3D cinema technology places a push-pull electro-optical liquid crystal 
modulator called a ZScreen in front of the projector lens to alternately polarize each 
frame. It circularly polarizes frames clockwise for the right eye and counter-clockwise 
for the left eye. Masterlmage uses a filter wheel that changes the polarity of the pro- 
jector’s light output several times per second to alternate the left-and-right-eye views. 
Dolby 3D also uses a filter wheel. The wheel changes the wavelengths of colors being 
displayed, and tinted glasses filter these changes so the incorrect wavelength cannot 
enter the wrong eye. The advantage of circular polarization over linear polarization 
is that viewers are able to slightly tilt their head without seeing double or darkened 
images. 

The XpanD system alternately flashes the images for each eye that viewers observe 
using electronically synchronized glasses The viewer wears electronic glasses whose 
LCD lenses alternate between clear and opaque to show only the correct image at the 
correct time for each eye. XpanD uses an external emitter that broadcasts an invis- 
ible infrared signal in the auditorium that is picked up by glasses to synchronize the 
shutter effect. 

IMAX Digital 3D uses two separate 2K projectors that represent the left and right 
eyes. They are separated by a distance of 64 mm (2.5 in), which is the average distance 
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between a human’s eyes. The two 2K images are projected over each other (super- 
posed) on a silver screen with proper polarization, which makes the image brighter. 
Right and left frames on the screen are directed only to the correct eye by means of 
polarized glasses that enable the viewer to see in 3D. Note that IMAX theatres use 
the original 15/70 IMAX higher resolution frame format on larger screens. 


2.4.3 Video Streaming over the Internet 


Video streaming refers to delivery of media over the Internet, where the client player 
can begin playback before the entire file has been sent by the server. A server-client 
streaming system consists of a streaming server and a client that communicate using 
a set of standard protocols. The client may be a standalone player or a plugin as 
part of a Web browser. The streaming session can be a video-on-demand request 
(sometimes called a pull-application) or live Internet broadcasting (called a push- 
application). In a video-on-demand session, the server streams from a pre-encoded 
and stored file. Live streaming refers to live content delivered in real-time over the 
Internet, which requires a live camera and a real-time encoder on the server side. 

Since the Internet is a best-effort channel, packets may be delayed or dropped by 
the routers and the effective end-to-end bitrates fluctuate in time. Adaptive stream- 
ing technologies aim to adapt the video-source (encoding) rate according to an esti- 
mate of the available end-to-end network rate. One possible way to do this is stream 
switching, where the server encodes source video at multiple pre-selected bitrates and 
the client requests switching to the stream encoded at the rate that is closest to its 
network access rate. A less commonly deployed solution is based on scalable video 
coding, where one or more enhancement layers of video may be dropped to reduce 
the bitrate as needed. 

In the server-client model, the server sends a different stream to each client. 
This model is not scalable, since server load increases linearly with the number of 
stream requests. Two solutions to solve this problem are multicasting and peer-to- 
peer (P2P) streaming. We discuss the server-client, multicast, and P2P streaming 
models in more detail below. 


Server-Client Streaming 


This is the most commonly used streaming model on the Internet today. All video 
streaming systems deliver video and audio streams by using a streaming protocol 
built on top of transmission control protocol (TCP) or user datagram protocol 
(UDP). Streaming solutions may be based on open-standard protocols published by 
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the Internet Engineering Task Force (IETF) such as RTP/UDP or HTTP/TCP, or 
may be proprietary systems, where RTP stands for real-time transport protocol and 
HTTP stands for hyper-text transfer protocol. 


Streaming Protocols 


Two popular streaming protocols are Real-Time Streaming Protocol (RTSP), an 
open standard developed and published by the IETF as RFC 2326 in 1998, and 
Real Time Messaging Protocol (RTMP), a proprietary solution developed by Adobe 
Systems. 

RTSP servers use the Real-time Transport Protocol (RTP) for media stream 
delivery, which supports a range of media formats (such as AVC/H.264, MJPEG, 
etc.). Client applications include QuickTime, Skype, and Windows Media Player. 
Android smartphone platforms also include support for RTSP as part of the 3GPP 
standard. 

RTMP is primarily used to stream audio and video to Adobe’s Flash Player client. 
The majority of streaming videos on the Internet is currently delivered via RTMP 
or one of its variants due to the success of the Flash Player. RTMP has been released 
for public use. Adobe has included support for adaptive streaming into the RTMP 
protocol. 

The main problem with UDP-based streaming is that streams are frequently 
blocked by firewalls, since they are not being sent over HTTP (port 80). In order 
to circumvent this problem, protocols have been extended to allow for a stream to 
be encapsulated within HTTP requests, which is called tunneling. However, tun- 
neling comes at a performance cost and is often only deployed as a fallback solu- 
tion. Streaming protocols also have secure variants that use encryption to protect 
the stream. 


HTTP Streaming 


Streaming over HTTP, which is a more recent technology, works by breaking a 
stream into a sequence of small HTTP-based file downloads, where each down- 
load loads one short chunk of the whole stream. All flavors of HTTP streaming 
include support for adaptive streaming (bitrate switching), which allows clients to 
dynamically switch between different streams of varying quality and chunk size dur- 
ing playback, in order to adapt to changing network conditions and available CPU 
resources. By using HTTP, firewall issues are generally avoided. Another advantage 
of HTTP streaming is that it allows HTTP chunks to be cached within ISPs or 
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corporations, which would reduce the bandwidth required to deliver HTTP streams, 
in contrast to video streamed via RTMP. 

Different vendors have implemented different HT TP-based streaming solutions, 
which all use similar mechanisms but are incompatible; hence, they all require the 
vendor's own software: 


。 HTTP Live Streaming (HLS) by Apple is an HTTP-based media streaming 
protocol that can dynamically adjust movie playback quality to match the avail- 
able speed of wired or wireless networks. HTTP Live Streaming can deliver 
streaming media to an iOS app or HTML5-based website. It is available as an 
IETF Draft (as of October 2014) [Pan 14]. 

。 Smooth Streaming by Microsoft enables adaptive streaming of media to clients 
over HTTP. The format specification is based on the ISO base media file for- 
mat. Microsoft provides Smooth Streaming Client software development kits 
for Silverlight and Windows Phone 7. 

e HTTP Dynamic Streaming (HDS) by Adobe provides HTTP-based adaptive 
streaming of high-quality AVC/H.264 or VP6 video for a Flash Player client 
platform. 


MPEG-DASH is the first adaptive bit-rate HTTP-based streaming solution 
that is an international standard, published in April 2012. MPEG-DASH is audio/ 
video codec agnostic. It allows devices such as Internet-connected televisions, TV 
set-top boxes, desktop computers, smartphones, tablets, etc., to consume mul- 
timedia delivered via the Internet using previously existing HTTP web server 
infrastructure, with the help of adaptive streaming technology. Standardizing an 
adaptive streaming solution aims to provide confidence that the solution can be 
adopted for universal deployment, compared to similar proprietary solutions such 
as HLS by Apple, Smooth Streaming by Microsoft, or HDS by Adobe. An imple- 
mentation of MPEG-DASH using a content centric networking (CCN) naming 
scheme to identify content segments is publicly available [Led 13]. Several issues 
still need to be resolved, including legal patent claims, before DASH can become 
a widely used standard. 


Multicast and Peer-to-Peer (P2P) Streaming 


Multicast is a one-to-many delivery system, where the source server sends each packet 
only once, and the nodes in the network replicate packets only when necessary to 
reach multiple clients. The client nodes send join and leave messages, e.g., as in the 
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case of Internet television when the user changes the TV channel. In P2P stream- 
ing, clients (peers) forward packets to other peers (as opposed to network nodes) to 
minimize the load on the source server. 

The multicast concept can be implemented at the IP or application level. The 
most common transport layer protocol to use multicast addressing is the User Data- 
gram Protocol (UDP). IP multicast is implemented at the IP routing level, where 
routers create optimal distribution paths for datagrams sent to a multicast destina- 
tion address. IP multicast has been deployed in enterprise networks and multimedia 
content delivery networks, e.g., in IPTV applications. However, IP multicast is not 
implemented in commercial Internet backbones mainly due to economic reasons. 
Instead, application layer multicast-over-unicast overlay services for application-level 
group communication are widely used. 

In media streaming over P2P overlay networks, each peer forwards packets to 
other peers in a live media streaming session to minimize the load on the server. 
Several protocols that help peers find a relay peer for a specified stream exist [Gu 14]. 
There are P2PTV networks based on real-time versions of the popular file-sharing 
protocol BitTorrent. Some P2P technologies employ the multicast concept when 
distributing content to multiple recipients, which is known as peercasting. 


2.4.4 Computer Vision and Scene/Activity Understanding 


Computer vision is a discipline of computer science that aims to duplicate abilities 
of human vision by processing and understanding digital images and video. It is such 
a large field that it is the subject of many excellent textbooks [Har 04, For 11, Sze 
11]. The visual data to be processed can be still images, video sequences, or views 
from multiple cameras. Computer vision is generally divided into high-level and 
low-level vision. High-level vision is often considered as part of artificial intelligence 
and is concerned with the theory of learning and pattern recognition with applica- 
tion to object/activity recognition in order to extract information from images and 
video. We mention computer vision here because many of the problems addressed in 
image/video processing and low-level vision are common. Low-level vision includes 
many image- and video-processing tasks that are the subject of this book such as 
edge detection, image enhancement and restoration, motion estimation, 3D scene 
reconstruction, image segmentation, and video tracking. These low-level vision tasks 
have been used in many computer-vision applications, including road monitoring, 
military surveillance, and robot navigation. Indeed, several of the methods discussed 
in this book have been developed by computer-vision researchers. 
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2.5 Image and Video Quality 


Video quality may be measured by the quality of experience of viewers, which can 
usually be reliably measured by subjective methods. There have been many studies to 
develop objective measures of video quality that correlate well with subjective evalu- 
ation results [Cho 14, Bov 13]. However, this is still an active research area. Since 
analog video is becoming obsolete, we start by defining some visual artifacts related 
to digital video that are the main cause of loss of quality of experience. 


2.5.1 Visual Artifacts 


Artifacts are visible distortions in images/videos. We can classify visual artifacts as 
spatial and temporal artifacts. Spatial artifacts, such as blur, noise, ringing, and block- 
ing, are most disturbing in still images but may also be visible in video. In addition, 
in video, temporal freeze and skipped frames are important causes of visual distur- 
bance and, hence, loss of quality of experience. 

Blur refers to lack or loss of image sharpness (high spatial frequencies). The main 
causes of blur are insufficient spatial resolution, defocus, and/or motion between 
camera and the subject. According to the Nyquist sampling theorem, the highest 
horizontal and vertical spatial frequencies that can be represented is determined by 
the sampling rate (pixels/cm), which relates to image resolution. Consequently, low- 
resolution images cannot contain high spatial frequencies and appear blurred. Defo- 
cus blur is due to incorrect focus of the camera, which may be due to depth of field. 
Motion blur is caused by relative movement of the subject and camera while the 
shutter is open. It may be more noticeable in imaging darker scenes since the shutter 
has to remain open for longer time. 

Image noise refers to low amplitude, high-frequency random fluctuations in the 
pixel values of recorded images. It is an undesirable by-product of image capture, 
which can be produced by film grain, photo-electric sensors, and digital camera 
circuitry, or image compression. It is measured by signal-to-noise ratio. Noise due 
to electronic fluctuations can be modeled by a white, Gaussian random field, while 
noise due to LCD sensor imperfections is usually modeled as impulsive (salt-and- 
pepper) noise. Noise at low-light (signal) levels can be modeled as speckle noise. 

Image/video compression also generates noise, known as quantization noise and 
mosquito noise. Quantization or truncation of the DCT/wavelet transform coeffi- 
cients results in quantization noise. Mosquito noise is temporal noise, i.e., flickering- 
like luminance/chrominance fluctuations as a consequence of differences in coding 
observed in smoothly textured regions or around high contrast edges in consecutive 
frames of video. 
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Ringing and blocking artifacts, which are by-products of DCT image/video 
compression, are also observed in compressed images/video. Ringing refers to oscil- 
lations around sharp edges. It is caused by sudden truncation of DCT coefficients 
due to coarse quantization (also known as the Gibbs effect). DCT is usually taken 
over 8 X 8 blocks. Coarse quantization of DC coefficients may cause mismatch of 
image mean over 8 X 8 blocks, which results in visible block boundaries known as 
blocking artifacts. 

Skip frame and freeze frame are the result of video transmission over unreliable 
channels. They are caused by video packets that are not delivered on time. When 
video packets are late, there are two options: skip late packets and continue with the 
next packet, which is delivered on time, or wait (freeze) until the late packets arrive. 
Skipped frames result in motion jerkiness and discontinuity, while freeze frame refers 
to complete stopping of action until the video is rebuffered. 

Visibility of artifacts is affected by the viewing conditions, as well as the type of 
image/video content as a result of spatial and temporal-masking effects. For example, 
spatial-image artifacts that are not visible in full-motion video may be higly objec- 
tionable when we freeze frame. . 


2.5.2 Subjective Quality Assessment 


Measurement of subjective video quality can be challenging because many param- 
eters of set-up and viewing conditions, such as room illumination, display type, 
brightness, contrast, resolution, viewing distance, and the age and educational level 
of experts, can influence the results. The selection of video content and the duration 
also affect the results. A typical subjective video quality evaluation procedure consists 
of the following steps: 


1. Choose video sequences for testing 

2. Choose the test set-up and settings of system to evaluate 

3. Choose a test method (how sequences are presented to experts and how their 
opinion is collected: DSIS, DSCQS, SSCQE, DSCS) 

4. Invite sufficient number and types of experts (18 or more is recommended) 

5. Carry out testing and calculate the mean expert opinion scores (MOS) for each 
test set-up 


In order to establish meaningful subjective assessment results, some test methods, 
grading scales, and viewing conditions have been standardized by ITU-T Recom- 
mendation BT.500-11 (2002) “Methodology for the subjective assessment of the 
quality of television pictures.” Some of these test methods are double stimulus where 
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viewers rate the quality or change in quality between two video streams (reference 
and impaired). Others are single stimulus where viewers rate the quality of just one 
video stream (the impaired). Examples of the former are the double stimulus impair- 
ment scale (DSIS), double stimulus continuous quality scale (DSCQS), and double 
stimulus comparison scale (DSCS) methods. An example of the latter is the single 
stimulus continuous quality evaluation (SSCQE) method. In the DSIS method, 
observers are first presented with an unimpaired reference video, then the same video 
impaired, and he/she is asked to vote on the second video using an impairment 
scale (from “impairments are imperceptible” to “impairments are very annoying”). 
In the DSCQS method, the sequences are again presented in pairs: the reference and 
impaired. However, observers are not told which one is the reference and are asked to 
assess the quality of both. In the series of tests, the position of the reference is changed 
randomly. Different test methodologies have claimed advantages for different cases. 


2.5.3 Objective Quality Assessment 


The goal of objective image quality assessment is to develop quantitative measures that 
can automatically predict perceived image quality [Bov 13]. Objective image/video 
quality metrics are mathematical models or equations whose results are expected to 
correlate well with subjective assessments. The goodness of an objective video-quality 
metric can be assessed by computing the correlation between the objective scores and 
the subjective test results. The most frequently used correlation coefficients are the 
Pearson linear correlation coefficient, Spearman rank-order correlation coefficient, 
kurtosis, and the outliers ratio. 

Objective metrics are classified as full reference (FR), reduced reference (RR), 
and no-reference (NR) metrics, based on availability of the original (high-quality) 
video, which is called the reference. FR metrics compute a function of the difference 
between every pixel in each frame of the test video and its corresponding pixel in the 
reference video. They cannot be used to evaluate the quality of the received video, 
since a reference video is not available at the receiver end. RR metrics extract some 
features of both videos and compare them to give a quality score. Only some features 
of the reference video must be sent along with the compressed video in order to 
evaluate the received video quality at the receiver end. NR metrics assess the quality 
of a test video without any reference to the original video. 


Objective Image/Video Quality Measures 


Perhaps the most well-established methodology for FR objective image and video 
quality evaluation is pixel-by-pixel comparison of image/video with the reference. 
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The peak signal-to-noise ratio (PSNR) measures the logarithm of the ratio of the 
maximum signal power to the mean square difference (MSE), given by 
255° 


PSNR = 101 一 一 一 
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where the MSE between the test video $[7,7,,k], which is N, X N, pixels and N; 
frames long, and reference video s|7,,7,,k] with the same size, can be computed by 
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Some have claimed that PSNR may not correlate well with the perceived visual 
quality since it does not take into account many characteristics of the human visual 
system, such as spatial- and temporal-masking effects. To this effect, many alterna- 
tive FR metrics have been proposed. They can be classified as those based on struc- 
tural similarity and those based on human vision models. 

The structural similarity index (SSIM) is a structural image similarity based FR 
metric that aims to measure perceived change in structural information between two 
N X N luminance blocks x and y, with means Ne and u, and variances ax and a, 
respectively. It is given by [Wan 04] 


SSIM (x,y) = ee a ad ke i il WO» ee 
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where g, is the covariance between windows x and y and cl and c, are small con- 
stants to avoid division by very small numbers. 

Perceptual evaluation of video quality (PEVQ) is a vision-model-based FR met- 
ric that analyzes pictures pixel-by-pixel after a temporal alignment (registration) of 
corresponding frames of reference and test video. PEVQ aims to reflect how human 
viewers would evaluate video quality based on subjective comparison and outputs 
mean opinion scores (MOS) in the range from 1 (bad) to 5 (excellent). 

VQM is an RR metric that is based on a general model and associated calibration 
techniques and provides estimates of the overall impressions of subjective video qual- 
ity [Pin 04]. It combines perceptual effects of video artifacts including blur, noise, 
blockiness, color distortions, and motion jerkiness into a single metric. 

NR metrics can be used for monitoring quality of compressed images/video 
or video streaming over the Internet. Specific NR metrics have been developed for 
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quantifying such image artifacts as noise, blockiness, and ringing. However, the abil- 
ity of these metrics to make accurate quality predictions are usually satisfactory only 
in a limited scope, such as for JPEG/JPEG2000 images. 

The International Telecommunications Union (ITU) Video Quality Experts 
Group (VQEG) standardized some of these metrics, including the PEVQ, SSIM, 
and VQM, as ITU-T Rec. J.246 (RR) and J.247 (FR) in 2008 and ITU-T Rec. 
J.341 (FR HD) in 2011. It is perhaps useful to distinguish the performance of these 
structural similarity and human vision model based metrics on still images and 
video. It is fair to say these metrics have so far been more successful on still images 
than video for objective quality assessment. 


Objective Quality Measures for Stereoscopic 3D Video 


FR metrics for evaluation of 3D image/video quality is technically not possible, since 
the 3D signal is formed only in the brain. Hence, objective measures based on a ste- 
reo pair or video-plus-depth-maps should be considered as RR metrics. It is generally 
agreed upon that 3D quality of experience is related to at least three factors: 


e Quality of display technology (cross-talk) 
。 Quality of content (visual discomfort due to accomodation-vergence conflict) 
。 Encoding/transmission distortions/ artifacts 


In addition to those artifacts discussed in Section 2.5.1, the main factors in 3D 
video quality of experience are visual discomfort and depth perception. As discussed 
in Section 2.1.4, visual discomfort is mainly due to the conflict between accom- 
modation and vergence and cross-talk between the left and right views. Human 
perception of distortions/artifacts in 3D stereo viewing is not fully understood yet. 
There have been some preliminary works on quantifying visual comfort and depth 
perception [Uka 08, Sha 13]. An overview of evaluation of stereo and multi-view 
image/video quality can be found in [Win 13]. There are also some studies evaluat- 
ing the perceptual quality of symmetrically and asymmetrically encoded stereoscopic 
videos [Sil 13]. 
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CHAPTER 3 
Image Filtering 





An in-depth understanding of image-processing methods is essential for a rigorous 
study of digital-video processing. This chapter introduces essential image-filtering 
operations, such as Gaussian and bi-lateral filtering, gradient estimation, and image 
interpolation, that are required for motion estimation, as well as foundations of some 
ill-posed problems such as denoising, restoration, and super-resolution. 


Image filtering refers to processing of an input image to produce either a better- 
looking output image by contrast/sharpness and/or signal-to-noise ratio enhance- 
ment, or to compute some low-level image features such as edges, corners, or 
spatial-gradient values that may be used in subsequent image processing. This chap- 
ter discusses most common linear and nonlinear image-filtering operations includ- 
ing image smoothing, image re-sampling (decimation and interpolation), gradient 
estimation, edge detection, image enhancement, image denoising, image deblurring 
(restoration), and image in-painting. 

A common practice in processing of color images is to first convert an RGB 
image into the luminance-chrominance (YCrCb) domain, process the luminance 
(Y) component only, and then convert back to RGB for display. This is because: 
i) processing R, G, and B components independently may alter the color balance, and 
ii) the human visual system is not very sensitive to high frequencies in chrominance 
components. Hence, this chapter considers processing of monochrome images only. 
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3.1 Image Smoothing 


Image smoothing refers to removing high-frequency details, which yields a softer or 
somehow blurry image. It is often employed as a pre-processing or an intermediate 
processing step in many image-processing operations, including image decimation 
and interpolation, gradient estimation, and image enhancement. It is also used for 
image denoising. Indeed, the algorithms discussed in Section 3.5 are image-smoothing 
algorithms tailored to particular noise models. We can classify image-smoothing 
algorithms as linear shift-invariant (LSI) filters, and nonlinear/adaptive filters. 


3.1.1 Linear Shift-Invariant Low-Pass Filtering 


A low-pass filtered image s,(7,,”,) can be computed either in the discrete Fourier 
transform DFT domain in terms of the filter-frequency response (see Chapter 1), or 
in the spatial domain by 2D-convolution summation 


s, (m,m) = > aer 2G rh) s(n, 一 直入 —i,) (3.1a) 


where /(i,,i,) denotes the impulse response of the filter and Wis the filter support. 
The impulse response must be normalized such that 


D aper A(i,,i,)=1 (3.1b) 


so that the mean intensity of the filtered image remains unchanged. 

Often separable or circularly symmetric filters (see Chapter 1) are preferred for 
ease of design and implementation. The most popular 2D linear shift-invariant 
smoothing filters are uniform (box) and Gaussian filters, whose impulse responses 
are depicted in Figure 3.1. Both filters are separable and can be derived from the 1D 
box filter 


A(m) =[11111] 
and. the 1D Gaussian filter 
h(n) = [147 4 1] 


respectively. The frequency responses of these 1D filters are shown in Figure 3.2. 
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Figure 3.1 5X 5 filter kernels with N, = N, = 2: (a) box filter and (b) Gaussian filter with o =1. 
The coefficients need to be normalized so that their sum is equal to 1. 
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Figure 3.2 Frequency response of (a) 5 X 1 box filter and (b) Gaussian filter with o = 3. 


Box Filtering 


If A(n,,n,) is uniform over a (2N, + 1)X(2N,+ 1) rectangular support, it can be 
implemented by a fast computational algorithm, called box filtering [McD 81]. The 
implementation of box filtering requires two running buffers, a vertical sum buffer 
(VSB) and a horizontal sum buffer (HSB), which are shown in Figure 3.3. The VSB 
is updated once every line by removing one whole line (the topmost line) and add- 


ing one new line. The HSB computes a running average over the current VSB. The 
VSB is initialized as 


VSBy, (m) = Er 5(7% >My) (3.2a) 


and is updated for n,=N,,... as 


VSB, (1) = VSB, (m) — s(m,n, — N3) + s(n,,n, + N, +1) (3.2b) 
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Figure 3.3 Illustration of the VSB and HSB for the implementation of the box filter. 


The HSB for line 7, is initialized as 
HSB, (N,) =X), VSB, (m) (3.3a) 


as 


and is updated for 2,= N 


a 
HSB, (nm +1) = HSB, (n,) — VSB, (m — N,) + VSB,,(n,+.N, +1) (3.3b) 


Then, the filtered output image is given by 


1 
"HSB 3.4 
su") = ON FON +n a) pe 
Hence, the computational complexity of box filtering is equal to one multiplica- 
tion (division), two additions, and two subtractions per pixel, independent of the 
size of the filter impulse response (kernel). 


Gaussian Filtering 


‘The Marr-Hildreth scale space theory employs a Gaussian kernel, with the scale 
parameter o, which can be implemented as a finite-impulse response filter over a 
(2N + 1) X (2N+ 1) support given by 


n +m 
20° 


A(n,,n,) = K e 








| —-N<n=N,-N<=n,=N (3.5a) 
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where K is a normalization constant. The scale parameter o specifies the amount of 
smoothing to be applied; the larger the ø, the more the smoothing. A typical value is 
0 = 2. It can be seen that the Gaussian filter evaluates a weighted sum of input pixels 
according to their distance from the center pixel. Note that b(n, n,) is separable, where 


2 
bi) 


` gta 


-| 天 


=Ce Ea 


24.3 
n +m 


A(n,,n,)= Ke 





20° 














= h (m): h, (m) (3.5b) 


Hence, h (71) and bh,(n,) can be applied to row and columns of the image, 
respectively, for efficient implementation. In order to implement the filter using inte- 
ger arithmetic, the value of C is set so that the smallest coefficient in 4,(,) or h, (7) 
is equal to 1. All other coefficients are truncated to the nearest integer. The filter 
output is normalized such that (3.1b) is satisfied. 


3.1.2 Bi-Lateral Filtering 


LSI filters, such as the Gaussian filter, assume local similarity; i.e., pixel intensi- 
ties in nearby locations are similar to each other. However, this assumption breaks 
down near edges, where contrast and/or color of pixels change suddenly. As a result, 
modeling similarity by geometric proximity (in the domain space of an image) 
causes blurring and/or color bleeding artifacts. Several nonlinear/adaptive filtering 
approaches, including directional filtering, anisotropic diffusion, rank-order filter- 
ing, mean-shift (which converges to the local mode), and bi-lateral filtering, have 
been proposed to overcome this well-known problem. Among these filters, bi-lateral 
filtering, which can be considered as an adaptive extension of Gaussian filtering, has 
become ubiquitous in image-processing and computer graphics applications includ- 
ing multi-resolution image representations, tone mapping, denoising, and texture 
editing/relighting [Par 08]. A theoretical analysis and discussion of the relationship 
between bi-lateral filtering and other nonlinear/adaptive image-filtering frameworks 
can be found in [Bar 04, Par 08]. 

In bi-lateral filtering, similarity is modeled by both geometric proximity and 
photometric similarity (distance between pixel intensities, i.e., in the range space of 
an image). The name bi-lateral filtering originates from this combined domain and 
range-space filtering [Tom 98]. The basic idea of bi-lateral filtering (sometimes called 
sigma filtering or robust filtering) is to compute a weighted average of pixels s( k) in 
a local (2N + 1) X (2N + 1) neighborhood of the current pixel n, which are within 
some gray-level distance of the intensity of current (center) pixel s(n), given by 


g(n) =}, w(n,k) s(k) (3.6) 


106 Chapter 3. Image Filtering 


where n= (n,,n,) T k= (k k) T and the weights w(n,k) depend on both spatial 
proximity and photometric similarity between pixels n and k. 

In bi-lateral Gaussian filtering, both the spatial-closeness function p(-) and the 
intensity similarity function q(-) are Gaussian. More specifically, 


_lla-kiP 
p(n—k)=e ary (3.7a) 
where ||-|| denotes the Euclidean distance of pixel k from the center pixel n of an 


NX N kernel, and a; determines the importance of spatial proximity, while 
_d(s(n), s(k)) 


gq(n—k)=e 7 (3.7b) 





where d(s(n), s(k)) = |ls(n) — s(k)|/? is the distance between the intensity or color of 
pixels n and k and øg? denotes the importance of intensity similarity. The coefficients 
(weights) of the bi-lateral filter when centered at pixel n are given by 


w(n,k) = K(n) p(n—k) g(n—k) (3.7c) 


where KUn) is a normalization constant so the weights for all pixels sum to one. 

In summary, each pixel is replaced by a weighted average of its neighbors where 
the weights are computed as in (3.7c). There are three free parameters: the size of the 
local neighborhood N, the scale parameter 0, and the range parameter a7. When 
the range parameter is large, the bi-lateral filter approaches the Gaussian filter. Oth- 
erwise, when the bi-lateral filter is centered on the bright side of an edge, the similar- 
ity function g(-) takes values close to one for pixels on the same side, and close to zero 
for pixels on the dark side of the edge. As a result, the filter replaces the center pixel 
by an average of bright pixels in its vicinity. Conversely, when the filter is centered 
on a dark pixel, a weighted average of only darker pixels are computed, which helps 
preserve edges. 

Several fast computational methods, based on the idea of quantization along 
the intensity axis and down-sampling in the spatial domain, have been proposed for 
efficient implementation of the bi-lateral filter [Par 08]. 


3.2 Image Re-Sampling and Multi-Resolution 
Representations 


Image re-sampling, also known as decimation and interpolation, requires evalua- 
tion of image intensity at sub-pixel locations. It appears in many image-processing 
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problems including image scaling, color de-mosaicking, multi-resolution representa- 
tions, sub-pixel motion estimation, motion-compensated filtering, image warping, 
and synthetic view synthesis. The filters used for image decimation and interpolation 
have a significant effect on the quality of the results. Most image re-sampling filters 
are separable; hence, 1D filters are applied independently in 7, and 7, directions. 
This reduces image re-sampling to a 1 D-signal-re-sampling problem. 

In multi-rate digital signal processing, decimation and interpolation are used to 
match the sampling rate of a signal with the bandwidth requirements of a specific 
application [Cro 83]. Interpolation refers to the process of up-sampling followed 
by appropriate filtering, while decimation refers to appropriate filtering followed by 
down-sampling. In the following, we first discuss decimation and then interpola- 
tion by an integer factor. Sampling rate change by a rational factor and polyphase 
filtering for efficient implementation are also presented. We present multi-resolution 
(pyramid and wavelet) image representation as an application of image re-sampling. 


3.2.1 Image Decimation 


Decimation refers to down-sampling of a signal; hence, it technically can only be 
applied to over-sampled signals (in the sense of the Nyquist sampling rate) without 
loss of information. Otherwise, it either causes aliasing (if no anti-alias filter is applied) 
or blurring (if the proper anti-alias filtering is applied). Decimation by a factor of 
M can be modeled in two steps: first, multiplication by an impulse train to replace 
M— 1 samples in between every Mth sample with zeros, and then discarding the zero 
samples to obtain a signal at the lower rate. We describe these steps in the following. 
Given the input signal s(n), define an intermediate signal w(m) by 


w(n) = s(n) UF 6(n— kM) (3.8) 
Then, the signal decimated by a factor of M can be expressed as 


y(n) = w(Mn) (3.9) 


The decimation process in the spatial domain without anti-alias filtering is illus- 
trated in Figure 3.4 for M = 2. The intermediate signal w(n) is introduced to facili- 
tate the frequency-domain characterization of decimation. To this effect, we first 
compute the Fourier transform of the intermediate signal w(n) using (3.8) as 


{ _27k 


Weet) =E {s(n 6(n— kM) } eI" = bi S $e K | (3.10) 
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s(n) (a) 





OQ. dy 2 3 4 5. 16 


Figure 3.4 Decimation by M= 2: (a) input signal; (b) intermediate signal; 
and (c) down-sampled signal. 


It can be seen that the spectrum of the intermediate signal has M replications of 
the input signal spectrum in the interval (~m, m). The spectrum Y(e/”) of the final 
decimated signal is obtained by expansion of the frequency axis of Wie”) given by 


({@—27rk 
A z | (3.11) 


If the bandwidth of the input S$S(e1®) is more than 7/M, then the replications will 
overlap and the decimated signal will suffer from aliasing. This is expected, because 
the sampling rate of the decimated signal should not be allowed to fall below the 
Nyquist rate. If an application mandates going below the Nyquist rate, then appro- 
priate anti-alias filtering should be applied prior to decimation. Ideal anti-alias filter- 
ing requires an ideal low-pass filter, as shown in Figure 3.5. The cutoff frequency of 
the ideal anti-alias filter is 7/M for decimation by M. 

Common choices for realizable anti-alias filters are box, Gaussian (Section 3.1.1), 
or bi-lateral filters (Section 3.1.2). Box filtering averages all pixels within a local 
window. Although it is a poor approximation to the ideal low-pass filter, it is often 
preferred because of computational simplicity. The Gaussian and bi-lateral filters are 
employed in multi-resolution image representations. 





Y(e”) = __, w(Mn) e 1” = w |e = TER S 





Efficient Polyphase Implementation of Decimation Filters 


We apply anti-alias filtering to the input image before down-sampling (at the high 
rate); however, M—1 out of every M samples of the filtered image are discarded 
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Figure 3.5 Decimation by M= 2: (a) system diagram; (b) spectrum of the input and the anti-alias 
filter; and (c) spectrum of the down-sampled output signal. 


during the down-sampling step. This redundancy may be avoided by skipping the 
computation of samples that will be discarded in the standard serial implementation, 
but that is not very efficient since the arithmetic units would remain idle in M—1 
sampling periods out of every M. An efficient implementation can be obtained 
based on the following observations: i) each retained output sample is generated 
by a sub-set of the filter coefficients, and ii) the subset of filter coefficients for each 
output sample varies in a periodic pattern with period M. Polyphase implementa- 
tion exploits these facts by forming M parallel branches, one for each subset of filter 
coefficients, and the order of anti-alias filtering and down-sampling is interchanged 
at each branch. That is, each branch works at low (output) rate, and the outputs of 
the M parallel short filters are summed to form the desired decimated signal. Details 
of polyphase implementation of decimation filters can be found in [Mit 06]. 


3.2.2 Interpolation 


Interpolation refers to computing sub-pixel intensity values. Image-interpolation 
methods can be classified as LSI vs. adaptive/nonlinear filters. The LSI interpola- 
tion process can be analyzed in two steps: i) up-sampling by zero filling (also called 
“filling-in”), and ii) low-pass filtering of the zero-filled signal. We first characterize 
the frequency spectrum of the zero-filled signal. Given a signal s(n), we define a sig- 
nal u(n) that is upsampled (zero-filled) by Z as 


u(n) = (37) n=0,2L,*21,... 


0 otherwise 


(3.12) 
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The process of up-sampling by zero filling is demonstrated in Figure 3.6 for the 
case L =3. Next, we take the Fourier transform of u(n) given by Eqn. (3.12). Using 
the definition of the Fourier transform for discrete-time signals, we have 


De a(n) = s(n) & * = S(e*") (3.13) 


a It can be seen from (3.13) that the spectrum of the zero-filled signal is related 
to the spectrum of the input signal by a compression of the frequency axis. This is 
illustrated in Figure 3.7 for the case of L = 3. Note that the spectrum of the input 
S(e/®) is assumed to occupy the full bandwidth, which is the interval (— m, 77) since 
the Fourier transform of discrete signals are periodic, with the period equal to 27 
along the normalized frequency axis. 


s(n) (a) 
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Figure 3.6 Up-sampling by L= 3: (a) input signal and (b) zero-filled signal. 
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Figure 3.7 Spectra of signals: (a) input; (b) zero-filled L=3; and (c) after ideal filtering. 
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Interpolation filtering aims at eliminating the replications caused by the zero- 
filling process. This requires ideal low-pass filtering of the filled-in signal as shown 
in Figure 3.7. The ideal low-pass filter would have a DC gain factor of L and cutoff 
frequency of w, = T/L. 

In the time domain, the filtering operation can be viewed as replacing the zero 
samples with non-zero values by means of a smoothing operation. The impulse 
response of the ideal interpolation filter is a sinc function given by 


iti sin(arn/ L) (3.14) 
anil L 


Thus, the interpolated signal samples are given by 


ee sin(ar(m— k)/ L) 
y(n) = Ve Sk) a (3.15) 


The impulse response of the ideal LSI interpolation filter has the properties 
that 4(0) = 1 and h(n) = 0, for n= + L, + 2L, .... Because of these zero crossings, 
y(n) = s(n) at the existing sample values, while assigning non-zero values for the zero 
samples in the upsampled signal. This sample preservation property is an important 
characteristic of all interpolation filters. 

The ideal interpolation filter is unrealizable since it is an infinite-impulse 
response, non-causal filter; i.e., its implementation requires infinite-time delay. 
Thus, it is approximated by one of the following realizable filters, which also possess 
the sample preservation property. 


Zero-Order Hold Filter 


Zero-order hold is the simplest interpolation method that corresponds to pixel repli- 
cation. The impulse response of the zero-order hold filter is given by 


n=] 1 if0<n<L-1 


0 otherwise 


(3.16) 


and its implementation is depicted in Figure 3.8. 
Note that the zero-order hold filter is a poor interpolation filter, since its fre- 
quency response is given by a sinc function, 


‘i ,lei -od sin(OL4 
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Figure 3.8 Interpolation filters (L = 3): (a) zero-order hold and (b) linear interpolation. 


which has large sidelobes. These sidelobes prevent effective removal of replications in 
the frequency domain, which results in aliasing artifacts in the interpolated signal. 


Linear Interpolation 


Linear interpolation computes a weighted sum of two nearest neighbor pixels (one 
on each side). The weights are inversely proportional to the distance of the pixel to be 
interpolated from its neighbors, resulting in unequal weights if L #2. The impulse 
response of the linear interpolation filter is given by 


L -|x 





fosa =L-1 


h(n) = (3.18) 


0 otherwise 


which is shown in Figure 3.8 for L = 3. The frequency response of the linear inter- 
polation filter is equal to the square of the sinc function. Thus, it has lower side- 
lobes than the zero-order hold filter. The filter (3.18) has the sample preserving 


property. 
Cubic-Convolution Interpolation 


The cubic-convolution filter computes a weighted sum of four nearest neighbor 
pixels (two on each side). The weights are obtained by approximating the impulse 
response of the ideal low-pass filter in Eqn. (3.14) by a piecewise cubic polynomial 
consisting of three cubics. 
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Figure 3.9 Impulse response of the cubic-convolution filter for L= 2. 


A method for designing finite-impulse response (FIR) cubic-convolution filters 
has been proposed by Keys [Key 81] (see Exercise 3.2). The impulse response of the 
filter, which has 4Z —1 taps, is demonstrated in Figure 3.9 for L =2. It is straight- 
forward to show that this filter also has the sample preserving property. We note that 
the cubic convolution filter approximates the unrealizable ideal low-pass filter better 
than a truncated sinc filter of the same length, because the frequency response of the 
truncated sinc filter suffers from ringing due to the well-known Gibbs phenomenon 
[Mit 06]. 


Efficient Polyphase Implementation of Interpolation 


The polyphase implementation is an efficient implementation that avoids multipli- 
cations by zeros. The subsets of the impulse response coefficients that affect computa- 
tion of each of Z samples define the L polyphase components of the interpolation filter 
[Mit 06]. The input s(n) is fed into each of the polyphase filters without zero-filling (at 
the low rate) as depicted in Figure 3.10. The output samples from component filters 
are interleaved in sequential order to form the interpolated output (high rate). 


Sampling Rate Change by a Rational Factor 


The theory of decimation and interpolation (by an integer factor) easily extends to 
re-sampling by a rational factor L/M, by first interpolating by a factor L and then 
decimating the result by a factor of M. Since an interpolator and a decimator are 
cascaded, the interpolation (post-) filter and anti-alias (pre-) filter of decimation can 
be merged into a single low-pass filter with the cutoff frequency f. = min{n/M, T/L}, 
which is depicted in Figure 3.11. The filter must satisfy the constraint that when 
location of the filter output sample matches an existing input sample location, the 
values of the existing samples must be preserved. If M > L, the system effectively 
performs decimation; otherwise, it performs interpolation. 

Re-sampling with a fractional factor L/M means that L output samples will fall 
at non-integer locations between M existing samples (pixels), whose values must be 
preserved. The polyphase implementation of re-sampling with a fractional factor 
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Figure 3.10 Polyphase implementation of the cubic-convolution filter in Figure 3.9. 
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Figure 3.11 Sampling rate change by a factor of L/M. 


LIM requires L component filters that are fed by input s(n), similar to that shown 
in Figure 3.10, whose outputs are sampled sequentially to be placed at these non- 
integer locations. The polyphase implementation may result in significant computa- 
tional savings, especially for prime factors, where the numerator is large. 


Adaptive/Nonlinear Interpolation and Single-Frame Super-Resolution 


LSI filters produce the desired number of output pixels but cannot generate higher 
resolution, i.e., cannot create details that are not present in the original image (fre- 
quencies beyond T/L are zero in Figure 3.7). Interpolation methods that aim to 
recover frequencies higher than T/Z from a single low-resolution image are called 
single-frame super-resolution (SR) methods. This is an ill-posed problem; hence, the 
solution must rely on some image model (see Appendix A). Te solutions can be 

classified as adaptive/nonlinear interpolation filters and single-frame SR methods. 


Adaptive/Nonlinear Interpolation 


LSI filtering treats all pixels the same. In contrast, adaptive/nonlinear filters change 
orientation and/or filter kernel depending on the local image properties to ren- 
der edges and texture with higher fidelity and minimize interpolation artifacts 
where they are most apparent. Edge-adaptive filters consist of two steps [Wan 07]: 
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i) detect the presence, orientation, and energy of an edge (see Section 3.3), and 
ii) interpolate by adapting the shape and parameters of the filter according to edge 
orientation and energy. Li and Orchard [Li 01] proposed an edge-directed inter- 
polation algorithm for natural images. They use local covariance coefficients, esti- 
mated from the low-resolution image, to adapt the interpolation filter by exploiting 
the geometric duality between the low-resolution and high-resolution covariances 
so that the filter coefficients are tuned to match an arbitrarily oriented step edge. A 
hybrid approach of switching between bilinear interpolation and covariance-based 
adaptive interpolation is proposed to reduce the overall computational complex- 
ity. Alternatively, anisotropic Gaussian filtering, where the filter kernel is tuned 
according to local edge orientation similar to bi-lateral filtering, has been proposed 
[Han 13]. A data-adaptive regression kernels framework unifies steerable filters 
and bi-lateral filtering [Tak 07]. Among stochastic model-based methods, Schultz 
and Stevenson [Sch 94] modeled the high-resolution image by a discontinuity- 
preserving Huber-Markov random field model and computed its maximum a 
posteriori (MAP) estimate. 


Single-Frame SR 


Single-frame SR methods include machine-learning methods and sparse model- 
based filters. Learning methods aim to establish a mapping from low-resolution 
image patches to high-resolution patches either using a set of pre-computed example 
pairs (dictionary) or self-similarity. Then, each input low-resolution image patch is 
compared to a set of stored low-resolution patches, and the high-resolution patch 
corresponding to the nearest low-resolution patch satisfying neighborhood match- 
ing criteria (e.g., a Markov network model) is selected as the output. Examples of 
this approach include the hallucination method [Bak 02], example-based super- 
resolution [Fre 02], and the sparse regression method [Kim 10]. 

Sparse modeling (see Appendix A) has proven useful for super-resolution, where 
a low-resolution image is modeled by a blur matrix H and sub-sampling matrix L, as 


y=LHs+v (3.19) 


In the case of image interpolation, the blur matrix H is taken as the identity 
matrix. Hence, sparse modeling becomes less effective since the data consistency 
term fails to constrain the local image structure. To this effect, Dong et al. [Don 13a] 
proposed incorporating non-local image self-similarity into sparse image representa- 
tion by using a non-local auto-regressive model as the data fidelity term. 
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Color De-Mosaicking 


Color de-mosaicking, used in most color cameras today, is an important applica- 
tion of image interpolation. Color filter arrays (CFA) for color image capture were 
discussed in Section 2.2.2 in Chapter 2. De-mosaicking refers to interpolation of 
missing pixels in each color channel. Commonly used bilinear and bicubic filters 
often cause false colors when applied to each color channel independently due to 
aliasing errors. To this effect, specific methods that take the correlation between 
R, G, and B channels into account have been developed for CFA interpolation 
[Gun 05, Men 11]. The interchannel correlation is modeled by assuming that 
color ratios (or differences) for an object remain constant, which prevents abrupt 
changes in hue. In order to exploit this model, the G channel is interpolated first, 
since it has the most pixels, using an edge-adaptive filter. Then R/G and B/G images 
are formed for existing R and B pixels, which are then independently interpolated. 
Final R and B images are obtained by multiplying the interpolated ratio images by 
the G channel values. 


3.2.3 Multi-Resolution Pyramid Representations 


In a multi-resolution pyramid representation, depicted in Figure 3.12, the original 
N, X N, image so(n1, 17,) appears at the bottom (level 0). This image is decimated 
(low-pass filtered and sub-sampled) by a factor of two (2) in each direction. The 
resulting Ni/2 X N,/2 image s,(,,”,) appears at the first level of the pyramid. The 
procedure can be repeated until a desired number of levels is reached. If a low-pass 
filter with a truncated Gaussian impulse response is used, the resulting pyramid is 
known as a Gaussian pyramid. 

The Gaussian pyramid is an overcomplete (redundant) representation since the 
total number of pixels in the pyramid 
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approaches 4/3 times the number of pixels in the image in the limit as the number 
of levels goes to infinity. Since the frequency response of the Gaussian filter has some 
leakage beyond the frequency w =77/2, images in the upper levels may contain alias- 
ing. In some vision applications, the sub-sampling step may be skipped such that all 
images are the same size but successively more blurred to avoid aliasing artifacts in 
lower resolution (upper-level) images. 
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Figure 3.12 Multi-resolution (multi-scale) pyramid representation. 


3.2.4 Wavelet Representations 


Unlike the pyramid representation, wavelet transfom is a multi-resolution image 
representation that is critically sampled, i.e., the number of samples in the wavelet 
transform is equal to that of the original image. It is preferred not only for image 
coding but also for other filtering applications, such as image denoising. There are 
many classes of wavelet transforms with different properties, such as compact sup- 
port (using FIR filters), orthogonality, symmetry, regularity, and degree of smooth- 
ness. Two of the most popular classes for image processing/compression are the 
orthogonal and bi-orthogonal classes of wavelet transforms. 

In order to understand discrete wavelet analysis, we consider the two-channel 
(binary) decomposition, where a 1D signal s(n) is split into two equal-size frequency 
bands, called the lower and upper frequency bands, as shown in Figure 3.13. 





Figure 3.13 Block diagram of sub-band decomposition and reconstruction filtering. 
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In wavelet analysis, the scaling function ® and mother wavelet V can be associ- 
ated with a low-pass filter H,( f) and a high-pass filter H (f), respectively. If we 
let the normalized sampling frequency equal to w = 27 or f=1, where w = 27f 
theoretically, we need an ideal low-pass filter with the passband f€(0,1/4) and an 
ideal high-pass filter with the passband f€ (1/4,1/2) for binary decomposition. In 
practice, we employ FIR filters, with impulse responses /[n] and 4, [7], subject to 
some constraints that are discussed below. The outputs of these analysis filters are 
sub-sampled by 2 to obtain the low-pass and high-pass subsignals, y,[”] and y, [7], 
respectively. Note that, due to sub-sampling, the sum of the number of samples 
in y,[7] and y,[7] is equal to the number of samples in s(n), hence, the wavelet 
transform is critically sampled. After processing or compression/decompression in 
the wavelet domain, sub-signals %[n] and 7, [7] are upsampled by zero filling, then 
filtered using filters g[7] and g [”], whose outputs are summed to reconstruct the 
signal §(m). The filters g,[”] and g [x] are called synthesis filters. 

The analysis-synthesis filters should have the following desired properties: 


1. Perfect-reconstruction (PR): PR refers to designing filters hln], h(n], gln], 
and g [7] such that forward transform (analysis filtering) followed by inverse 
transform (synthesis filtering) gives $(n) =s(m), assuming that subsignals are not 
altered in the wavelet domain, i.e., j,[”] =y,[”] and ¥,[”]=y,[z]. In order to 
design realizable filters, we have to allow an overlap between the passbands of 
the low-pass and high-pass filters to avoid any frequency gaps. A pair of low- 
pass and high-pass filters, whose frequency responses exhibit mirror symmetry 
about f=1/4, as depicted in Figure 3.14, is called a quadrature mirror filter 
(QMF) pair. The fact that the frequency response of the low-pass filter H,( f) 
extends into the band (1/4,1/2) and vice versa causes unavoidable aliasing when 
each sub-band signal is decimated by 2. 





0 1/4 1/2 


Figure 3.14 Frequency response of 1D binary decomposition filters. 


bs 
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In order to achieve alias-free (PR) analysis-synthesis filtering, the filters 
must be designed in such a way that the aliasing introduced by the analysis 
filter is exactly canceled by the synthesis filter. In order to see how this can be 
achieved, we express the Fourier transform of the subsignals y,[n] and y, [n], 
after subsampling by 2, as 


sp- 外人 到。 em 


2 2 2 
Y,(f) -J (£}s(£}+2,(-£}s(-L] (3.20b) 


respectively. The reconstructed signal can be expressed as 


S(f) = GIP VQ + GNEP (3.21) 


Assuming 7(f)=Y(f) and Yf=y,/, and substituting (3.20) into 
(3.21), 


SP =H NGA) + HG] SP) 
HCPC) Sf) 
Alias-cancellation (PR) can be achieved provided the filters satisfy 


HAAGAN +H (f)G,(f) =2 (3.22a) 


H,(—f)G,(f)+ A, (-f)G,(f) =0 (3.22b) 


. Symmetry: An important concern in image transforms is avoiding increasing 


the number of samples. Since linear convolution increases the number of sam- 
ples, filtering is implemented by circular convolution, which yields the same 
number of output samples as that of the input. We apply a symmetric boundary 
extension in order to avoid introducing unnecessary high-frequency energy due 
to artificial left-to-right and top-to-bottom intensity discontinuities. However, 
to preserve symmetry after filtering so the number of wavelet coefficients does 
not increase, the filters must also be symmetric. Note that odd- (even-) length 
symmetric FIR filters are zero- (linear) phase filters. 
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3. Orthogonality: Orthogonal filters implement a projection of the input image 
onto a set of orthogonal basis functions. With proper normalization, orthogo- 
nal transforms preserve energy and norm, as stated by Parseval’s theorem. 


It has been shown that the only filter that satisfies two-channel perfect recon- 
struction, symmetry, and orthogonality conditions is the trivial case of 2-tap Haar 
filter pair, 4,[7] = {1,1} and 4, [7] = {1,—1}, which does not have good frequency 
selectivity or energy compaction properties. Hence, in order to design perfect recon- 
struction FIR filters (compactly supported wavelets) with good frequency selectivity 
or energy compaction (a larger number of vanishing moments), we need to give 
up either symmetry or orthogonality. It turns out we can design orthogonal FIR 
filters that are perfect reconstruction and nearly symmetric or that are symmetric 
and nearly perfect reconstruction, or bi-orthogonal (non-orthogonal) symmetric FIR 
filters that enable perfect reconstruction. In the following, we briefly discuss the 
classes of orthogonal and bi-orthogonal filters used in wavelet image processing and 
compression in more detail. 


Orthogonal Filters 


Early sub-band coding methods have employed orthogonal, symmetric FIR (lin- 
ear phase) quadrature mirror filters (QMF) that are nearly perfect reconstruction. 
Orthogonal, FIR perfect reconstruction filters, such as Symlet or Coiflet families 
that are nearly symmetric [Dau 92], are preferred for image denoising, since orthog- 
onal transform of white Gaussian noise (image domain) is again white Gaussian in 
the wavelet domain. 

In orthogonal QMF design, perfect reconstruction can be achieved by canceling 
the aliased spectra, given by (3.22b), with the following simple choice of filters: 


H,(f)=H(-f)=H, (+) 
G(f)=2 H,(f) (3.23) 
Note that the first condition is equivalent to 
h,{n] = (—1)"4,[n] 


Hence, orthogonal FIR analysis and synthesis filters have the same length. Substitut- 
ing (3.23) into (3.21), the reconstructed signal becomes 
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S(f) =|? (f)- Aa(f)] SY) 
For perfect reconstruction, it follows that orthogonal filters must also satisfy 
[A (f)— Hz (f)|=1 for all f (3.24) 


Johnston filters allow for some small amplitude distortion in (3.24), while Sym- 
lets and Coiflets are near symmetric PR filters designed to satisfy some vanishing 
moments and regularity constraints. A complete derivation of these filters is beyond 
the scope of this book, and interested readers are referred to [Dau 92] for details. 


Bi-Orthogonal Filters 


In image compression, orthogonality decorrelates the transform coefficients to 
minimize redundancy and ensures the energy of the quantization error commit- 
ted by quantization of the transform coefficients remains unchanged in the pixel 
domain. However, symmetry turns out to be more important than orthogonality, 
since symmetry is crucial for properly handling image borders without increasing 
the number of transform samples. Hence, bi-orthogonal filters have become the de 
facto choice for wavelet image compression. The bi-orthogonality constraints enable 
designing perfect reconstruction, symmetric (linear phase) FIR filters by relaxing the 
strict orthogonality requirement. The perfect reconstruction conditions (3.22a) and 
(3.22b) can also be stated as 


H,(f)G,(f)+ H,(-f)G,(-f) = 2 (3.25a) 
H,(f)G,(f)+ H,(-f)G,(-f) =2 (3.25b) 
H (f)G,(f)+ A, (-f)G,(—f) =0 (3.25c) 
H,(f)G,(f)+ A, (-f)G,(—f) =0 (3.25d) 


which can be expressed in the spatial domain as bi-orthogonality constraints [Dau 92] 


(h[k], g,[2n — k]) = 6[n] (3.26a) 
(A, [A], .g, [2 — k]) = 8[7] (3.26b) 
(g,[h],4,[2n—k]) =0 (3.26c) 


(g,[k],4,[2n —k]) =0 (3.26d) 
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where (-,-) denotes the inner product. In wavelet terms, the filters {4,[7], 4, [7]} are 
derived from a pair {®,, Y}, while the filters {g,[], g, []} are derived from another 
pair {®,, W} which are related to {®,, W} by the bi-orthogonality constraints. The 
bi-orthogonality conditions provide us with more flexibility to design odd-length 
symmetric FIR filters. 

There is a large family of bi-orthogonal wavelets. Among these, (9,7) filters are 
nearly orthogonal and provide good energy compaction. The (9,7) and (5,3) filters 
have been selected for use in the JPEG2000 standard. The bi-orthogonal wavelet 
filters possess some regularity properties that orthogonal filters do not have, which 
provides bi-orthogonal filters with improved coding efficiency over orthogonal filters 
with the same number of taps (see Chapter 7). 

The 1D decompositions can be extended to two dimensions by using separable 
filters, i.e., decomposing the image s(7,,7,) first in the row and then in the col- 
umn direction, or vice versa. Using a binary decomposition in each direction, we 
obtain four sub-bands called low (L) y, (1, 7,), horizontal (H) y,,(7,,7,), vertical 
(V) yy(n1, 17,), and diagonal (D) yp (n,,n,), corresponding to lower-lower, upper- 
lower, lower-upper, and upper-upper subbands, respectively. The decomposition can 
be continued by splitting all sub-bands or just the L-subband repetitively, as shown 
in Figure 3.15. Other decompositions are also possible. 





Figure 3.15 Two-level binary-tree decomposition of an image into frequency bands. 


3.3 Image-Gradient Estimation, Edge and Feature Detection 123 


The wavelet transform coefficients then correspond to pixels of the respective sub- 
images. In most cases, the decomposition is carried out in multiple stages. The total 
number of samples in all subimages y; (71; 73), Yy (7> n3)» Yy (71; n3), and yp (7, 75) 
after subsampling is the same as the number of samples in the input image s(7,, 7). 
Thus, the wavelet decomposition itself does not result in data compression or expan- 
sion. Observe that y, (11, 17) corresponds to a low-resolution version of the image 
s(2,,”,), while yp (11, 7,) contains the high-frequency detail information. Therefore, 
the wavelet decomposition is also known as a “multi-scale” or “multi-resolution 
representation, and can be used in progressive transmission. 


3.3 Image-Gradient Estimation, Edge and 
Feature Detection 


In order to understand modeling image edges, we first look at a continuous 1D sig- 
nal s(x), where an ideal edge-can be represented by a unit-step function. However, in 
most real-life signals transition from low to high or from high to low intensity value 
is gradual and the edge location can be defined as the point where the first derivative 
s(x) has an extremum, or equivalently, where the second derivative s"(x) is zero, as 
shown in Figure 3.16. 

Images are two-dimensional; hence, we need to replace the concepts of first and 
second derivative with the gradient vector and Laplacian, respectively. The gradient 
vector for a continuous image s (x1) is defined as 


Os (rX) 
ôx, 


ôs, (Xi, X3) (3.27) 
dx, 


where 6s /6x denotes partial derivatives. The magnitude of the gradient is given by 


Vs, (xi, x, ) = 


2 


O5,(% 5%) g 


5x (3.28) 


| Vs, (x, x,)| = | 





OS, (x; 5X5) . 
bx, 





1 


The Laplacian of a continuous image is the dot product of its gradient by itself 
87s, (Xs%,) R 8°s.(x,,x,) 
(dx, y (dx, ig 


which is a scalar. The Laplacian is isotropic favoring no particular edge direction. 


V's, (ax) = V-Vs,(x,,x,) = (3.29) 
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(b) 


Figure 3.16 Illustration of an edge for a 1D continuous signal and its first and second derivatives: 
(a) increasing edge and (b) decreasing edge. 


Next, we need to estimate the first and second partials from discrete images, 
which are discussed in Section 3.3.1 and 3.3.2, respectively. Because the derivative 
operators are indeed high-pass filters, edge detection is sensitive to noise. Two types 
of errors may be observed: i) False positives: Noise may generate many small peaks 
in the magnitude of the gradient resulting in false edges. ii) False negatives: Noise 
may result in shifts in true edge locations, resulting in missing actual edge pixels. 
A popular edge-detection algorithm that addresses these problems to find meaningful 
edges is introduced in Section 3.3.3. 


3.3.1 Estimation of the Image Gradient 


We first discuss approximating partial derivatives by finite differences and provide 
some popular so-called “edge-detection operators” based on these approximations, 
and then present a method for estimating partials by the derivative of Gaussian filter. 


Estimation of Partials by Finite-Difference Operators 


Given a discrete image s(7,,,), the horizontal and vertical partial derivatives can be 
approximated by respective finite differences. The horizontal partial can be estimated 
by the horizontal forward difference 


Os. (eX) 


== s[m, +1,2,]—s[n,,2,] (3.30a) 
ôx, 
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or the horizontal backward difference 


Òs (a) 


Ox 


1 


= s[n,,n,]—s[n, —1,n,] (3.30b) 


Since we do not know which one will be a better estimate, we can compute the 
average of the forward- and backward-finite differences, called the central difference, 
as a more robust estimate 


bs, (aX) S 


3 (st +1,2,]—s{n, —1.,]) (3.31) 
Xl 


Finite differences are sensitive to noise. In order to alleviate the effects of observa- 
tion noise, we can compute a local average (at the same horizontal sample 7, over the 
current line 7,, the line before and the line after) of the average differences, called the 
average central difference 


85,(x,,* 1 
A s (npm) = 二 {(s[m +1,n,]—s[n, —1,7,]) 
bx, 6 
Ha Faia) 
Haa tlt Ga 
Other averaging strategies also exist for estimating partials using finite differ- 
ences. Estimation of the partials in the vertical direction can be treated in a similar 
way. 
Hence, the gradient of a discrete image can be expressed as 
5,, (m7) 


So (mm) 


h (m, n,)**s(m ,1 ) 
h, (n,n, )**s(m, n) 


Vs (x1 %3) = Vs[n,,2,] = = (3.33) 











where the computation of s,(m,,) and s,(m,,m,) can be interpreted as 2D- 
convolution operations, i.e., FIR filtering of an image s(”,,7,) with finite-impulse 
responses /,(7,,2,) and 4,(n,,7,), respectively, or 2D-correlation operations. Since 
2D convolution requires flipping the impulse response with respect to both axes, cor- 
relation operator kernels are 2D flipped versions of the impulse responses. 

Since the gradient is a vector, the magnitude and the direction of the gradient 
vector at a pixel (7,,7,), which indicate the strength and the direction of a possible 
edge at the pixel (7,,7,), respectively, are given by 
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Vs [7,57 ]| = 全 (n,.m)) +(s,,(rm)) (3.34a) 
l$ (n7) 

x (Vsi, n,]) = tan | (3.34b) 
5, (737) 


In the following, we introduce some commonly used operators for gradient esti- 
mation, which are also known as edge-detection operators. The FIR filters 4,(7,, 7) 
and 4,(n,,,) are given by 2D flipped versions of these operators. 

The Prewitt operator represents the average central difference approximation 
given by (3.32). The Prewitt kernels for estimation of partials in the x, and x, direc- 
tions are shown in Figure 3.17(a) and (b). The more popular horizontal and vertical 
Sobel operators are shown in Figure 3.17(c) and (d). The difference between Prewitt 
and Sobel operators is that the latter applies twice the weight to the center row 
horizontally and vertically. Isotropic filters do not favor any particular edge direc- 
tion. Prewitt and Sobel filters respond to diagonal edges differently than the hori- 
zontal and vertical edges because their coefficients do not take into account larger 
inter-pixel distances in the diagonal directions. The Prewitt filter is less sensitive to 
diagonal edges than to horizontal and vertical ones, while the opposite is true for 
the Sobel filter. The Roberts cross operators, shown in Figure 3.17(e) and (f), aim to 


== 0 1 0 0 0 
=] 0 1 -1 -1 -=l 
(a) (b) 
=i 0 1 1 2 1 
=2 0 2 0 0 0 
=] 0 1 一 1 =g =] 
(c) (d) 

1 0 0 1 
0 —f =z] 0 
(e) (f) 


Figure 3.17 Operators to compute image partials, also called edge-detection operators: 
(a) horizontal Prewitt operator; (b) vertical Prewitt operator; (c) horizontal Sobel operator; 
(d) vertical Sobel operator; (e) and (f) Roberts cross operators. 
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approximate the gradient of an image by computing the sum of squares of the differ- 
ences between diagonally adjacent pixels. 


Estimation of Partials by Derivatives of Gaussian Filtering 


Spatial presmoothing of an image with a Gaussian filter usually helps with gradient 
estimation in the presence of noise. The estimation of partial derivatives from Gauss- 
ian smoothed images can be modeled as 


s, (m,m) = h (m,m) ** (gm 7,) + 5(7,,7)) (3.35a) 
where g(7,,7,) is the Gaussian smoothing filter (see Section 3.1.2) and 4,(7,, 7) is 


a finite-difference operator to compute the partial in the horizontal direction. Since 
2D convolution is associative, we can rewrite (3.35a) as 


s„ (m7m) = (Bm, 2) ** g(m,7,)) **s(7,,7) (3.35b) 
The combined filter is called the derivative of Gaussian filter, given by 
hi (n,,n,) =h (m,n) ** g(m,m) (3.36) 


In order to evaluate the impulse response of the derivative of Gaussian filter, we 
take a continuous 2D Gaussian with scale parameter Cr 























_{ 过 十 好 
g. (x) = Ke = 
and compute its partials in the horizontal and vertical directions as 

[ait 

hinaa ee = —K1¢\* (3.37a) 
i 

Xi 十 X2 

On tes Shs a A -Re (3.37b) 
ôx, T 


Implementation of the derivative of Gaussian filtering requires sampling h(x,,x,) 
and hp (x,,x,) over a finite support for a given value of ø (see Exercise 3.1). 

We note that the derivative of the Gaussian filter is separable; hence, both the 
horizontal and vertical partial filters can be efficiently implemented as a cascade of 
two 1D filters each, where 
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bi(n,n,) = -K, “ye on Kae Fe (3.38a) 
and 

Es 4-5 n J- 

h2(n,,n,) =K E | “Ke ‘ | (3.38b) 


Computation of image gradients is often implemented over a multi-resolution 
Gaussian pyramid, shown in Figure 3.12, in top-to-bottom fashion. Edges at the top 
(coarser) levels of the pyramid correspond to major edges, while other edges at lower 
(finer) levels are regarded as finer details. 


3.3.2 Estimation of the Laplacian 


The Laplacian is a scalar, which is the counterpart of the second derivative for 
multi-variable functions. It can be estimated by finite-difference operators or by 
the Laplacian of Gaussian filtering [Hue 86]. Recall that the zero-crossings of a 
Laplacian-filtered image indicate edge locations. 


Estimation by Finite-Difference Operators 


In order to compute a discrete approximation to the Laplacian, we use the forward 
difference to approximate the first horizontal partial derivative: 


s, [m,,m] = s[m, +1,2,]—sln,,n,] 
and then the backward difference of first differences, to approximate the second 


derivative as the derivative of the first derivative, to compute the second horizontal 
difference s, , [7,,7,] as follows: 


Slimm mml s [2, —1,,] 


= sin +1,2,]—2s[2,,n,]+ sla —1,7,] (3.39a) 


The second vertical difference is computed similarly, given by 


S.a [mm] = simm +1] —2sim m] + simn —1] (3.39b) 
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Figure 3.18 Different approximations yield different Laplacian operators. 
Then, a discrete approximation to the Laplacian can be defined as 


Vm) S [man is m] (3.40) 
= s[n, +1,n,]+s[n, —1,2,]+s[n,,n, +1]+5[n,,2, —1]—4s|n,,n,| 
Eqn. (3.40) can be considered as an FIR filter with the impulse response shown in 


Figure 3.18(a). Other approximations to estimate the Laplacian yield other FIR fil- 
ters, shown in Figure 3.18(b) and (c). 


Estimation by the Laplacian of the Gaussian Filter 


Similar to the derivation of the derivative of the Gaussian filter, we can merge pre- 
smoothing of the image by a Gaussian filter and estimation of the Laplacian into a 
single filter, given by 


Vo [5,(2, 522) ** g(x, 5x,)] = V7 en) ** 5, (045%) (3.41) 


since both the convolution and Laplacian are linear shift invariant operations. Hence, 
we can define the Laplacian of Gaussian (LoG) filter as 





2 2 = 
X + — 207 


gí 








Alaiz) = ys [g. EREN = (3.42) 
Digital implementation of the LoG filter requires sampling 4,(x,,x,) on an 

appropriate support for a particular value of ø. We note that the LoG filter is not 

separable, but it can be approximated by a difference of Gaussians (DoG) filter 
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V? |g. (x)= Ke ie (3.43) 

















with proper choice of ol and g, for efficient implementation. 


3.3.3 Canny Edge Detection 


We treat edge detection for a monochrome image s(n1, 1,) only, since edge detection 
for color images is often performed on the luminance (Y) channel. Edge pixels are 
defined as those pixels where the magnitude of the gradient has a maximum and/or 
the Laplacian is zero (see Figure 3.16). Determination of edge pixels is often imple- 
mented by simple thresholding as follows: 


[Vs „7, )| >T (3.44a) 
and/or 


Vi s(n Be er (3.44b) 


where Tand € are some threshold values [Dav 75, Sha 01]. The problem with simple 
thresholding as in Eqn. (3.44) is in how to determine good threshold values (7 and 
£) and an appropriate scale parameter ø to be used in the derivative of Gaussian and 
LoG filters. If ø is large, the image will be smoothed (blurred) too much and some 
edges may be lost. If T is chosen too small or € is too big (in their own scales), then 
detected edges may be too thick (poor localization); otherwise, some edges may not 
be detected (poor detection). | 

Canny’s edge detection, perhaps the most widely used method, addresses a com- 
promise between good detection and good localization. Canny [Can 86] shows that 
the derivative of the Gaussian filter provides a close approximation to an optimal 
filter that maximizes the product of a localization and detection measure. Canny’s 
method also includes a non-local maxima suppression step for thinning, and a hys- 
terisis thresholding step for computing meaningful connected edges. 

The complete Canny edge-detection procedure is as follows: 


1. Computation of Image Gradient: Evaluate the magnitude and direction of the 
image gradient, given by (3.34a) and (3.34b), at each pixel, where s, (12) 
and S (n, 7,) are computed by the derivative of Gaussian filters (3.38a) and 
(3.38b), respectively. 
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2. Non-Local Maximum Suppression: This is an edge-thinning procedure using 
the direction and magnitude of the gradient. If the magnitude of the gradient 
at the center pixel (7,,7,) of an 8-neighborhood is less than the magnitude 
of the gradient in at least one of its two neighbors in the direction of the 
gradient, the magnitude of the gradient at the pixel (7,,7,) is set equal to 
0. The computed gradient direction is rounded off to one of 0, 45, 90, or 
135 degrees to determine the two 8-neighbors of the center pixel (7,, ”,) for 
comparison. 

3. Hysterisis Thresholding: This is an edge-linking procedure. Two thresholds, a 
high and a low threshold, are defined for edge detection and edge following, 
respectively, where the high threshold is two or three times the low threshold. 
This is a two-step procedure: i) All pixels where the magnitude of the gradient 
is above the high threshold are labeled as edge pixels, and ii) all pixels where the 
magnitude of the gradient exceeds the lower threshold are kept as edge pixels if 
they are connected to an already labeled edge pixel. 


A set of edge maps over a range of scales can be produced by varying ø. A fine- 
to-coarse synthesis can fuse edges at different scales into a single edge map. 


3.3.4 Harris Corner Detection 


Consider taking an image patch centered at the pixel (x,,x,) and shifting it by 
(d,,d,). The weighted sum of squared differences (SSD) between the original and 
shifted patches is given by 


E(d,,d,)= XY w(x,,x,)[s(x, +d,,x, +d,)— s(x) 
If we approximate s(x, +d,,x,+d,) by a Taylor expansion 


s(x + dix +d,)= s(x,,x,) +5, (xx) d +5, (x,,%,) d, 


where s, (pX) and s, (Œp X) denote partial derivatives of s(x,,x,), we can express 


the SSD 


E(d,,d,) =~ 2D w(x, % ds, (qa a +, (Xr Xz) dr 


a 2 
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which can be rewritten in matrix form as 
di 
E(d,,d,)~|d, d,|M J 
F 
where 


DL w(x, x2 )(s,, (492 yr LD w(x,,%2)5,, (%15%2)5,, (Kio) 


A Bi. XQ XX 

— 2 
DOD (x,y) Sq, (15% )5_,(% 9%) LD w(% 5%) (ss (Xi,X2)) 
x, Xz Ni Iz 


is called the Harris matrix. If a circular weighting, such as a Gaussian, is used, then 
the response will be isotropic. The Harris matrix is a function of first partials of an 
image about a patch (x,,x,). 

At a corner or an interest point, the function E(d,,4,) must exhibit a large varia- 
tion for all shifts (d , d,)# (0,0). This condition can be expressed as M should have 
two “large” eigenvalues at an interest point. We can reach the following conclusions 
based on the magnitudes of the eigenvalues: 


1. IfA, =0 and A,~0, then the pixel (x,,x,) has no features of interest. 

2. IfA,~0 andA,>A, a horizontal edge, or A,~0 and A, >A, a vertical edge, is 
found. 

3. IfA, and A, both have large positive values greater than a threshold, then a 


corner is found. 


Harris and Stephens [Har 88] observe that exact computation of eigenval- 
ues is computationally expensive, and instead suggest the following cornerness 
function: 


C = AÀ, —a(A, +à} =det M—a(trace (MD 


where æ is a tunable sensitivity parameter. Their algorithm does not actually com- 
pute the eigenvalues of the matrix M; instead, they evaluate the determinant and 
trace of M to find corners. The Shi~Tomasi [Shi 94] corner detection method (also 
known as detection of good features to track) checks min{A ,,A,} because the eigen- 
value analysis produces more stable corners for motion tracking. 
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Scale-invariant feature transform (SIFT ) keypoints are another set of corner-like 
features that are computed by analyzing the output of DoG filters at successive levels 
of scale [Low 04]. The SIFT system uses a feature post-processing stage, which is 
similar to that of the Harris detector. 


3.4 Image Enhancement 


Image enhancement refers to processing with the goal of improving attributes, such 
as contrast and sharpness, of an image to make it visually more pleasing. It has 
received significant attention for more than four decades since the era of film-based 
photography. This section addresses two specific problems: i) given an image with 
poor contrast, compute a better-looking image with higher contrast and sharper 
details; ii) given a high dynamic range image (acquired as described in Chapter 2), 
apply dynamic range compression to better render details in bright or dark areas 
on a standard display. Image-enhancement methods can be classified as pixel-based 
contrast-enhancement and spatial-filtering methods. 


1. Pixel-based tone mapping/contrast enhancement: Underexposed or overexposed 
pictures have poor contrast. Contrast stretching (also known as histogram nor- 
malization) is a process that scales pixel intensity (brightness) values to better 
utilize the range between 0 and 255. It is sometimes called dynamic-range expan- 
sion; however, this is not technically correct since dynamic range refers to the 
resolution of image-brightness values, and pixel-based contrast-enhancement 
operators cannot generate new intermediate pixel values to increase the bright- 
ness resolution or image detail. Pixel-based contrast-enhancement methods are 
discussed in Section 3.4.1. 

2. Spatial filtering for tone mapping and image sharpening: Linear or nonlinear 
spatial filtering can be applied to emphasize medium-range spatial frequencies 
to obtain crispier images. These filters can also be combined with pixel-based 
operators to implement tone mapping for dynamic-range compression of high- 
dynamic range (HDR) images to display them on a standard 8 bits/color dis- 
play systems. Spatial filters are discussed in Section 3.4.2. 


3.4.1 Pixel-Based Contrast Enhancement 


Pixel-based contrast-enhancement operators map all pixels with a given input bright- 
ness value to the same output brightness value, which corresponds to shifting and 
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scaling of the independent variable (intensity) of the histogram of an image. We start 
by defining the histogram of an image. We then introduce commonly used pixel- 
based contrast-enhancement operations. 


Image Histogram 

Assume that gray values of s(,,7,) are quantized to K levels, i.e., 0 = s(#,,7,) = 
K— 1 for all (n,n,). The histogram H (k) gives the relative frequency of occurrence 
of each gray-level & in the image. That is, 


H(k) = 





for &=0)..:;K —1 (3.45) 


2 


if the gray-level & occurs J times in the image and we have an N, X N, image. The 
pixel counts J are normalized by the total number of pixels NV, N, in the image, so 
that 


K-I 
2 H,(k)=1 


This way, the histogram H (k) approximates the probability density function 
(pdf) of the image. That is, if a pixel location (”,,7,) is chosen at random, then 
H (k) gives the probability that s(7,,7,) = k. Statistics of the image can be computed 
from its histogram. For example, the mean value of an image can be computed as 


B=D k Ae) (3.46) 
We can also define the cumulative normalized image histogram as 
CH= A E; &=0,....K—1 (3.47) 


which approximates the cumulative distribution function (CDF) of the image. It 
means that for a randomly selected pixel (n,n,), Pr{s(n,, 73) =k} = C (k). We note 
that the CDF is a non-decreasing function, and C(K— 1) = 1. Furthermore, H (k) 
can be obtained from C (k) by 


H, (k)=C,(k)—C,(k—1), k=0,...,K—1 
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Linear Contrast Manipulation 


Suppose we process an input image s(7,,7,) using the generalized linear mapping 
g(n,,n,) =a s(n,,n,)+ B (3.48) 


If we let a = 1, then we have an additive bias. This corresponds to shifting the 
histogram to the left or right depending on the sign of B. Of course, proper clipping 
should be used to ensure pixel values do not overshoot or undershoot the allow- 
able range. If a #1, then we also have a contraction or stretching of the histogram 
depending on whether œ < 1 or a > 1. Again, proper clipping should be used to 
ensure pixel values do not overshoot or undershoot the allowed range. 


Example 1: Automatic Gain Control (AGC) 


Many digital cameras employ AGC that stretch the image histogram to fully 
utilize the entire dynamic range from 0 to K— 1. Let the minimum and 
maximum gray levels in an image be A and B, respectively. The goal of AGC 
is to find the mapping parameters to map A and B to 0 and K— 1, respec- 
tively, such that 


aA+B=0 
aB+B=K-1 
which can be solved for the two unknowns to give 


Kel 


Hence, the AGC mapping is given by 


aaea 
S\7 722 BEA 





(s(7,,72,) ia A) 


Example 2: Image Negative 


‘The negative of an image can be computed as 


By, hy) = — Ket) (k= 
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which corresponds to flipping of the image histogram 
H (k) = H(K—1-—&) 


Histogram Equalization 
Histogram equalization is a pixel-based nonlinear operation that aims to produce an 
output image with a more uniform histogram. An image with a flat histogram has 
maximum entropy. This is achievable if H(-) and C(-) are functions of continuous 
variables (intensity values). Digital histogram equalization maps every occurrence 
of a quantized gray level k to C(&); hence, histogram bins can never be increased or 
reduced in pixel count. Therefore, it is not possible to attain an output image with 
a perfectly flat histogram. However, it is possible to obtain an output image, which 
has a more strecthed out histogram than the input image. In the following, we first 
derive the histogram equalization operation assuming that H (-) and C\(-) are func- 
tions of continuous variables. We then discuss the digital approximation of it. 
Suppose that H (x) and C(x) are functions of a continuous variable x, such that 


H,(x) = wou 


where C(x) is non-decreasing. We assume that an inverse C, -1(x) can be defined for 
C(x) since C(x) is non-decreasing. Then, we claim that the image 


g(n,,n,) = C,(s(,,2,)) (3.49) 
has a uniform (flat) histogram. We can show this as follows: 
C, (x) = Pri g msn) = x} = PrIC,(sln,,m,]) = x} 
= Polam] EC N= (C1) =x, OS x <1 
Therefore, 


_ dC, (x) 
Hf (x)= = =10Sx*=1 





Based on this analysis, a procedure for digital histogram equalization can be 
summarized as: 
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1. Compute the histogram H (k) of the image, k= 0,...,K—1 
2. Compute the cumulative distribution function 


C.(k) = k=0,...K—1 


3. Compute g[7,,”,] = (K— 1) C(s[x,,2,]) for all pixels (7, 7) 


Histogram Shaping 

Histogram shaping is a generalization of histogram equalization, where the output 
image should have a pre-specified desired histogram as opposed to having a flat his- 
togram. Again, the exact desired histogram can only be obtained in the hypothetical 
case of continuous-intensity gray-scale images. Suppose the desired CDF is Q(x) and 


Q-1(x) can be defined. Then, 


gmn) =Q '(C,(s(n,,2,))) (3.50) 


has the desired CDF Q(x) if every pixel has continuous (not-quantized) intensity 
values. This can be shown, similar to the case of histogram equalization, as 


C, (x) = Prig(n.n) = x} = PriQ(C,(s(n,,n,))) = x} 
= Pr{C, (s(n,7,)) = Q(x)} 


= Pr {s(m,m,) =C, (Q(x) =C,(C, (Q(x) = Q(x), OS x <1 


In the practical case of histogram shaping for images with quantized gray values, 
the desired CDF Q(&) is discrete, and Q7 !(k) must be defined carefully. In most 
cases, Q(k) is either empirically stated or computed from another image (histogram 


matching problem). Then, Q ‘(hk) can be defined as 


Q '(k) = min, {/:Q(/) =k} (3.51) 


which completes the specification of the histogram-shaping method. 


Local Contrast Manipulation by Pixel-Based Operators 


Contrast manipulation using global operators based on the histogram of the 
entire image does not produce satisfactory results if the input image has a bimodal 
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(a) 


Figure 3.19 Adaptive histogram equalization: (a) input image containing both light and dark 
areas and (b) output image with 50 x 50 partially overlapping windows. 


histogram. An example image with a bimodal histogram containing both very bright 
and very dark areas is shown in Figure 3.19(a). In such cases, we can achieve adaptive 
contrast enhancement by applying pixel-based contrast operators on local sliding 
image windows that are possibly overlapping, such that each window has a unimodal 
histogram containing either dark or bright areas. 


3.4.2 Spatial Filtering for Tone Mapping and 
Image Sharpening 


We begin this section with a tone-mapping method that is specifically developed for 
dynamic range compression of HDR images. We next discuss the retinex filter and 
unsharp masking and its extension based on bi-lateral filtering, which have been 
proposed as image-enhancement methods, but can also be used for dynamic range 
compression of HDR images. 


Digital Dodging-and-Burning for Tone Mapping [Rei 02] 


Reinhard [Rei 02] proposed a tone-reproduction operator for dynamic range com- 
pression of HDR images that is inspired by the photographic dodging-and-burning 
process. Dodging-and-burning is an artistic photographic printing technique where 
some light is manually withheld (dodging) from a region during development, or 
more light is added (burning) to a region. This will lighten or darken selected regions 
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in the final print relative to what it would be if the same amount of light were used 
for all regions of the print. The digital processing has two steps: 


1. Luminance scaling for a given scene key value: Compute the geometric mean of 
the scene luminance: 
1 
T ae Wet > log(S+L, (x; ,x2)) (3.52a) 
s 
where L (x,,x,) is the scene luminance at pixel (x,,x,), Vis the number of pixels 


in the image, and 6 is a small constant to avoid singularity if black pixels are 
present in the image. Then, for a given key a, the luminance values are scaled as 


L(a) = L (x2) 


5 


(3.52b) 


The key of a scene indicates whether it is subjectively light, normal, or dark. 
A predominantly white scene would be high-key, and a dim scene would be 
low-key. The value of a is typically between 0.09 (low-key) and 0.36 or higher 
(high-key). For normal key scenes, a= 0.18. 

2. Dynamic range compression: If the dynamic range of the image exceeds that of 
the display, as would be the case in HDR images, all luminance values cannot 
be displayed, and high values should be saturated by a compressive function. 
A simple compressive function is of the form 


Ex.) 


(3.52c) 
LFL 


Li (x)= 


where L ,(x,,x,) denotes display values. However, this pixel-based scaling may 
result in loss of some important image details. A photographer would resort to 
dodging-and-burning to vary exposure locally to overcome this problem. A dig- 
ital operator that resembles local dodging-and-burning process may be given by 


Po Ce 


E a ANN (3.52d) 
1+ VI (3) 


Ex 3%) = 


where V,(x,,x,,5,,(*,,*,)) is a local average for pixel (xp x,) with a properly 
chosen neighborhood [Rei 02]. It is possible to compute the local average by 
means of a bi-lateral filter to avoid a halo effect around sharp edges. 
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Unsharp Masking 


Unsharp masking (USM) is a spatial-filtering operation that enhances medium to 
high frequencies, including edges. The USM filter is given by 


g(m:n)= s, (mm) + Bls(m,,2,) — si (7)] (3.53) 


where s,(7,,7,) is low-pass filtered input image s(n, n,) and B is the filter gain. 
In the frequency domain, we have 


Gle” sen” ) = S, (e7 en” ) + BIS (e en” ) = S (e7 en )] (3.54) 


The operation of the filter in the frequency domain is demonstrated by an example 
of a 1D signal in Figure 3.20. 

The difference, NG; n) 4; lh n), is a detail image with medium-to-high fre- 
quency content since low frequencies are subtracted as shown in Figure 3.20. Hence, 
the basic idea of USM is to decompose an input image into a low-pass and a detail 
image. The detail image is multiplied by a gain factor B and then added back to the 
low-pass filtered image. The result is an image with medium to high frequencies 
(texture and edges) boosted or enhanced. 

The frequency range of the difference image determines what frequencies shall 
be enhanced in the output image, which is controlled by varying the bandwidth 
parameter of the low-pass filter. The low-pass filtered image can be computed in the 


So) So) 


— 


(QJ @ 


(a) (b) 


S(@)-S,(@) 





(c) (d) 


Figure 3.20 Illustration of the concept of unsharp masking: (a) original signal spectrum; 
(b) low-pass filtered signal spectrum; (c) difference spectrum scaled by B; and (d) spectrum of 
the enhanced signal obtained by adding (b) and (c). 
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spatial domain or in the DFT domain using a separable 2D Gaussian or uniform 
(box filter) impulse response over a rectangular support. Fast implementation of box 
filtering [McD 81] was discussed in Section 3.1.1. 


Adaptive Filtering 


The basic idea of unsharp masking has been extended to adaptive contrast manipula- 
tion as well as matching the dynamic range of an image to that of the display [Pel 82, 
Dur 02]. Peli and Lim [Pel 82] merged nonlinear pixel scaling and spatial filtering in 
an adaptive-filtering framework: 


g(m,n,)=T[s,(m,m,)1+ Bls,(m,,7,)][s(2,,7,) — s, (m,n,)] (3.55) 


where 7[-] is a nonlinear point operation to match the dynamic range of the image 
to that of the display medium, and the filter gain B changes according to the value 
of the smoothed image. The block diagram of the adaptive-filtering framework is 
depicted in Figure 3.21(a). Examples of T[-] and B(s ) are shown in Figure 3.21(b). 

Durand and Dorsey [Dur 02] proposed replacing the simple Gaussian low-pass 
filter with a bi-lateral filter and used the resulting nonlinear filter for dynamic-range 
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s(n pn) g(,,7,) 
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B(s) 





(b) 


Figure 3.21 Adaptive image enhancement: (a) block diagram of the adaptive filter and 
(b) examples of nonlinear functions T[s,(n,,n,)] and Bis, (n, na)]. 
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compression of HDR images. They compress the dynamic range of only the smoothed 
luminance image (modeling scene illumination) by a choice of T '[s; (74, 7,)] and pre- 
serve the detail image as well as the Cr and Cb components. 


Retinex 


Low-contrast images are often a result of unfavorable illumination. We.can obtain 
a better image by removing the effect of undesired illumination. Observed image 
luminance L(x,,x,) can be modeled by the product of incident illumination [(x,,x,) 
and scene reflectance R(x; x), as 


Elit) = TR, 5 Hy R G) (3.56a) 


Taking the logarithm of both sides, we transform the image to the log-luminance 
domain: 


L(x x) = log L(x,,x,) = log I(x,,x,) + log R(x,,x,) 


= ile) trl) (3.56b) 


where the effect of undesired illumination is additive. Retinex refers to a model and 
algorithm that was proposed by Land and McCann [McC 04] for removal of an 
undesired illumination component, which is modeled by an additive ramp (gradi- 
ent) in the log-luminance domain. 

There are various retinex algorithms that implement an iterative sequence of 
pixel comparisons at various scales or distances. Each iteration consists of so-called 
ratio, product, reset, and average operations. The overall effect of these iterative com- 
putations is low-pass filtering in the log-luminance domain. The McCann99 retinex 
implemented over a multi-resolution pyramid where the top-level is not larger than 
5X 5 pixels (MATLAB implementation available) [Fun 04] and the multi-scale ret- 
inex with color restoration [Rah 04] are the most popular implementations. 


Homomorphic Filtering 


Related to retinex, filtering in the log-intensity domain (since human visual system 
perceives the logarithm of intensity) is known as homomorphic filtering [Opp 68]. 
The block diagram of homomorphic filtering is depicted in Figure 3.22, where a 
linear filter is used in the log-intensity domain. A high-pass filter may be employed 
to remove the low-frequency illumination component similar to retinex and obtain 
an image with more even illumination. Alternatively, a low-pass filter may be used 
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s(7,)n,) Linear glnan) 
LoG 


Figure 3.22 Block diagram for homomorphic filtering. 


to eliminate multiplicative intensity-domain noise. The main difference between ret- 
inex and homomorphic filtering is that in homomorphic filtering there is an expo- 
nentation after filtering that takes the output image back to the intensity domain for 
display. 


3.5 Image Denoising 


Image-capture mechanisms are not perfect. Images suffer from graininess due to 
photon noise, electronic noise, quantization noise, and impulsive noise due to sen- 
sor cell defects. Speckle noise is common in radar-image sequences and biomedical 
cine-ultrasound sequences. As a result, all images are contaminated by noise to some 
extent, which may or may not be visible. Even if noise may not be perceived at full- 
speed video due to the temporal-masking effect of the eye, it often leads to poor- 
quality “freeze-frame” still images. The signal-to-noise ratio (SNR) is an important 
imaging parameter, and it varies with the imaging modality and device. 

Besides resulting in visually displeasing images and masking image details, noise 
also poses serious problems with solving ill-posed image processing problems. In 
image restoration and super-resolution, noise is the fundamental limitation in recov- 
ering high-frequency information. In motion estimation, it is important to distin- 
guish intensity variations due to motion from those due to noise. In image/video 
compression, noise increases the entropy, hindering effective compression. 

This section presents spatial denoising filters, where a single image is processed. 
Multi-frame video denoising is studied in Chapter 6. We can classify spatial noise 
filters in several ways: i) linear vs. nonlinear, ii) shift-invariant vs. adaptive, iii) local 
vs. non-local, and iv) pixel-wise vs. block-wise. Most filters fall into more than one 
of these classes, e.g., local, pixel-wise, adaptive or non-local, block-wise, nonlinear. 
After discussing image and noise modeling in Section 3.5.1, Section 3.5.2 introduces 
linear space-invariant filters that may be implemented in the spatial or transform 
domain. Local adaptive filters, such as the local LMMSE filters and directional fil- 
ters, are covered in Section 3.5.3. We study nonlinear filters, such as order statistics 
filters, wavelet shrinkage, and bi-lateral filters in Section 3.5.4. Non-local filters, 
such as the non-local means (NLM) and BM3D, are introduced in Section 3.5.5. 
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A unifying data-adaptive (steerable) kernel regression framework that covers most of 
these filters and iterative filters can be found in [Mil 13]. 


3.5.1 Image and Noise Models 


It is obvious that exact separation of fluctuations in image intensity due to noise 
from genuine image detail is impossible. All denoising methods are based on model- 
ing self-similarity and sparsity characteristics of images and statistical characteristics 
of noise using models with varying complexity. Our ability to separate the signal 
from noise depends on how well various models allow such separation. 


Noise Models 


The noise can be modeled as additive or multiplicative, white or colored, and signal- 
dependent or signal-independent. A simple additive noise model is given by 


y(n, = s(,,n,) + v(m,n,) (3.57) 


where s(n1, 2,) and v(2,,7,) denote the ideal image and noise, respectively. 
A noise source is called white noise, if all noise samples are uncorrelated with 
each other, i.e., 


E {v(m n) v(i i )} Fe 0,6(n hyn, 一 万) 


0 otherwise 


where E{:} denotes the expectation operator and a? is the variance of the noise. 
Independent and identically distributed (i.i.d.) is a stronger condition than white. 
A noise source is signal-independent if 


E {s(n,,,) v(é,,é,)} = 0 for all (n,n,) and (i,,i,) (3.59) 


For example, photon noise and film grain are signal-dependent, whereas charge- 
coupled device (CCD) sensor noise and quantization noise are usually modeled as 
white, Gaussian, and signal-independent. Ghosts in TV images can also be mod- 
eled as signal-dependent noise. Other noise models can be found in Chapter 4.5 
of [Bov 00]. 
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The level of noise in an image is specified in terms of the signal-to-noise ratio 
(SNR), defined as 
2 
2 


SNR = 10log,, (3.60) 
oO 


in decibels (dB), where g is-the variance of the original noise-free image. If the SNR 
is below some level, typically 30 dB, i.e., the noise variance is more than 1/1000 of 
that of the image, the noise becomes visible as a pattern of graininess, resulting in a 
degradation of the image quality. 


Estimation of the SNR 


In general, we need to estimate the variance g? of the noise and variance g? of the 
noise-free image from the given noisy image. This requires an ergodicity assumption, 
which allows us to estimate ensemble image statistics from a given sample image. 

The variance of the noise can be estimated from a flat (untextured) image region, 
where the variance should ideally be zero in the absence of any noise. To this effect, 
we manually mark a flat rectangular region Wand first estimate its mean 


A 1 ve 
Myew SH PD lis (3.61) 


where M is the number of pixels in the selected region. The variance of the noise is 
estimated as 


A A 1 . . A 
Go; es 人 a wM De iew hl- Myew 小 (3.62) 


In order to estimate the variance of the noise-free image, we repeat this procedure, 
this time over the whole picture to obtain ó? . Then, the estimate of g? is given by 


7 E R E- 
om = max { a, —o,,0 } (3.63) 
so that Ê? is always non-negative. 


Image Models and Performance Limits 


Various image models that have been used in image denoising are summarized in 
Appendix B. Minimizing a cost function based on the /?-norm of the error (expressed 
as the minimum mean square error) subject to a global smoothness constraint (using 
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a homogeneous random field model) results in the well-known Wiener filter given 
by Eqn. (3.65). Minimizing the /'-norm (expressed as the sum absolute error) yields 
the median filter. The wavelet shrinkage has been shown to minimize the /°-norm 
of the estimation error, which is a sparseness measure. Non-local patch-similarity 
based models have also been used (see Section 3.5.5) for image denoising. Clearly, 
performance limits of denoising filters are strongly related to how well the image and 
noise models match the real situation. A lower bound for achievable mean-square 
error derived in [Cha 10] suggests that there may be room for further improvement 
to reach the performance limits in image denoising. 


3.5.2 Linear Space-Invariant Filters in the DFT Domain 


Linear space-invariant (LSI) denoising filters can be designed and analyzed using 
frequency-domain concepts, and are easier to implement. Natural images have more 
energy at low frequencies than at high frequencies, and the spectrum of white noise 
is flat as illustrated in Figure 3.23. An LSI noise reduction filter typically attenuates 
frequencies where the noise power exceeds the signal power. However, in LSI denois- 
ing, there is an inherent tradeoff between noise reduction and blurring of image 
detail. This is because any LSI denoising filter is essentially a low-pass filter, also 
called a smoothing filter (see Section 3.1.1), which suppresses high frequencies. As a 
result, high-frequency image content is also attenuated, causing blurring. 

The linear minimum mean-square error (LMMSE) filter is the optimal LSI filter 
that yields the minimum mean-square error estimate of the ideal image; i.e., it is 
the optimal filter in the minimum mean-square error sense among all linear filters. 
Loosely speaking, it determines the best cutoff frequency for low-pass filtering given 
the power spectra of the ideal image and the noise. In the following, we derive both 
the infinite-impulse response (IIR) and finite-impulse response (FIR) LLMSE filters, 





@ 


Figure 3.23 Tradeoff between noise reduction and blurring in LSI denoising filters. 
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using the principle of orthogonality, assuming that image and noise are wide-sense 
stationary; i.e., their means are constant and their correlation functions are shift- 
invariant. The image and noise are assumed to be zero mean without loss of general- 
ity, since any non-zero mean can be removed prior to filtering. 


IIR Wiener Filter 
The input-output relationship for the IIR LMMSE filter can be expressed in the 


form of a convolution, given by 


5(n, z n,) = Da aa N bla »4,) y(n, > i, > nN, = i) (3.64) 


g 


where (7,,7,) is the noisy image, $(7,,7,) denotes the LMMSE estimate of the ideal 
image s(7,,7,), and A(7,,7,) is the impulse response of the filter. 

The principle of orthogonality states that estimation error, s(”,,7,)—5$(m,,n,), at 
each pixel should be orthogonal to every sample of the observed image, which can 
be expressed as 


(s(a) — s(n), VLR, kl) = Ef{(s(m,,2,) —$(m,,n,)) yk, kl} =0 (3.65) 


for all (n,n,) and (&,,4,), where the inner product (-,-) is defined in terms of the 
expectation operator F{-}. Thus, orthogonality means uncorrelatedness. 
Substituting (3.64) into (3.65), and simplifying the resulting expression, we 


obtain 


= 


ELD oo Danco Mish.) Ry lt —4 — ho, ~  — B)} = Ry — ham, k) 
for all (n,n,) and (&,,k,) (3.66a) 


where 


Ry (a mi A kot hA kA = En, itd) Wk, k)} 


denotes the autocorrelation function of the observations, and 


R(n —k,,n,—k,) = E{s(n,, 25) lk,» ks 


148 Chapter 3. Image Filtering 


is the cross-correlation between the ideal image and the observed image. The double 
summation can be expressed as a 2D convolution: 


h(n,,n,)**R,, (1,7) = R, (m7) (3.66b) 


which is called the discrete Wiener—Hopf equation. The expression (3.66) defines 
the impulse response of the noncausal, IIR Wiener filter, also known as the unrealiz- 
able Wiener filter. This filter is unrealizable because an infinite-time delay is required 
to compute an output sample. 

We can obtain the frequency response of the unrealizable Wiener filter by taking 
the 2D Fourier transform of both sides of (3.66b), which results in 


P (fs fr) A 


3.66 
Pf sf) ia 


A(f>f) = 


where P, (fi f) and 已 UV) are the auto- and cross-power spectra, respectively. The 
derivation, up to this point, is quite general, in the sense that it only assumes that the 
filter (3.64) is linear, uses the principle of orthogonality (3.65), and is independent 
of the image and noise models. However, derivation of the expressions P, (f, f) and 
» 
P, oh) requires a problem-specific observation model. 
For the denoising problem, the auto- and cross-power spectra in (3.66) can be 
derived from the observation model (3.57) and the noise model (3.58) and (3.59) as 
R, (1,7,) = Efst) yi T RN; A e n, )} 
= E{s(é,,i,) s(i, i is st n,)} ig E{s(i,,i,)v(, me fh ME n, )} 
= R,(n,,n,) (3.67a) 


and 


R,, (m7) = {(s@,,2,) + v(Z,,4,))(s(4 — noh Smyt v(i an 1, 1, —n,))} 


= R,(m,n,) +R, (n,n,) (3.67b) 


where we assume that the image and noise are uncorrelated. Hence, the power spec- 


tra PFA) = Pf and Pf» ft) =P foh) Paih) can be obtained by 
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taking the 2D Fourier transform of the respective correlation functions. Then, the 
frequency response of the Wiener filter becomes 


| 
EOE TY EA (3.68) 
Wl) DE A) +P. fof) 


We observe that P,, (fi, fb ) = 7 since the noise is white and the power spectrum 
of the noise-free image can be approximated with that of the noisy imageas P (f> f) = 
\Y(f>4)|?. It is well-known that the estimate can be improved by sectioning the 
image and averaging the spectrum estimates over the sections. 

The Wiener filter is a low-pass filter, since the image power diminishes at high 
frequencies, which implies that the filter frequency response goes to zero at high fre- 
quencies, whereas at low frequencies the noise power is negligible compared to the 
image power so that the frequency response of the filter approaches one. 

A realizable approximation to the filter (3.68) can be obtained by frequency- 
sampling design, where the filter frequency response H(f,, f,) is sampled in the fre- 
quency domain using N, X N, samples. This filter can be efficiently implemented 
using N, X N, fast Fourier transform (FFT). The frequency-sampling design is 
equivalent to approximating the impulse response /(7,,7,) of the IIR filter with an 
N, X N, FIR filter with the impulse response h(n,n,), given by 


人 (3.69) 

Note that the frequency-sampling design method suffers from spatial-domain 
aliasing. In practice, this may be negligible, provided that NV, and N, are reasonably 
large, e.g., N, =N,=512 or larger. 


3.5.3 Local Adaptive Filtering 


Linear shift-invariant (LSI) filters limit our ability to separate genuine image detail 
from noise, because they are based on wide-sense stationary (homogeneous) image 
models. Local image models offer rich possibilities for adaptive image processing 
that overcomes limitations of LSI filters. Geman and Geman [Gem 84] model local 
interactions between pixels using Markov random field (MRF) models and use a 
nonlinear Bayesian framework for denoising and image restoration. MRF models 
and optimization methods are reviewed in Appendices C and D, respectively. In this 
section, we discuss two simple (non-iterative) filters: an adaptive LMMSE filter and 
a directional filter. 
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FIR-LMMSE Filter 


Alternatively, we can pose the denoising problem as an optimal linear shift-invariant 
FIR filter design problem. Assuming the observed image and the estimate are NX N 
arrays, the FIR-LMMSE filter can be expressed in vector-matrix form as 


s = Hy (3.70) 


where § and y are NV’ X 1 vectors formed by lexicographic ordering of the estimated 
and observed image pixels, and H is an N? X N? matrix operator formed by coef- 
ficients of the FIR filter impulse response. 

The principle of orthogonality can be stated in vector-matrix form as 


E{(s—8)y'}=0 (zero matrix) (3.7 1a) 


which states that every element of s — ŝ is uncorrelated with every element of y. 
Substituting (3.70) into (3.71a), we obtain 


FE{(s—Hy)y'}=0 
which can be simplified as 
E{sy'}=H Elyy'} (3.71b) 


Then the FIR-LMMSE filter operator H can be obtained as 


H=R, R; (3.710) 


where R, is the auto-correlation matrix of the observed image and R, is the cross- 
correlation matrix of the ideal and observed images. 

Given the observation model, we can easily show, as in the derivation of the IIR 
filter, that R „=R p, and R „=R, + R, where R „ and R,, are the auto-correlation 
matrices of the ideal image and the observation noise, respectively. Then the filter 
operator becomes 


H=R,(R,+R,,]" (3.72) 


Observe that the implementation of the filter (3.72) requires inversion of an N? X N? 
matrix. For a typical digital image, e.g., N=512 or larger, this is a formidable task. 
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However, assuming that the image and noise are wide-sense stationary, i.e., they 
have constant mean vectors (taken as zero without loss of generality) and spatially 
invariant correlation matrices, the matrices R and R, are block-Toeplitz. It is com- 
mon to approximate block-Toeplitz matrices by block-circulant ones, which can be 
diagonalized through the 2D-DFT operation [Gon 07]. The reader can readily see 
that the resulting frequency-domain FIR filter expression is identical to that obtained 
by sampling the frequencies (f, f) in Eqn. (3.68). 


Adaptive LMMSE Filter 


As a compromise between the complexity of (3.72) and preserving important image 
detail, a simple space-varying image model, where the local image characteristics are 
captured in a space-varying mean, has been proposed [Lee 80, Kua 85]. The residual 
image after removing the local mean is modeled by a white Gaussian process with a 
space-varying variance. This section presents a spatially adaptive LMMSE estimator 
that is based on this model. The resulting filter is easy to implement, yet avoids blur- 
ring in the vicinity of edges and other detail. 
We define a residual image as 


r (m) = s(m,,n,) 一 u, (m,m) (3.73) 


where u(n n) is a spatially varying mean image. The residual image r (ni, 1,) is 
modeled by a white process, i.e., its correlation matrix, given by 


R (n,n) =0} (n,,n,) (n,n,) 


is diagonal. Note that the variance, o? (n, n,), of the residual also varies from pixel 
to pixel. It follows that the residual of the observed image, defined as 


r, (m,m) = y(m,n,)— fh, (m1) (3.74) 


where H (n, n,) is the local mean of the observed image, is zero-mean and white, 
because the noise is assumed to be zero-mean and white. Furthermore, from (3.57), 


W121 n) = H (n 7) 
since the noise is zero-mean. 


Applying the FIR-LMMSE filter (3.70) to the residual observation vector, 
r =y By which is a zero-mean, wide-sense stationary image, we have 
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#—p,=Hr,=R,[R, +R] (y—p,) (3.75a) 
The matrix R is diagonal because 7,(7,,7,) is white. Then, the above vector- 
matrix expression simplifies to the scalar form 
o` (n,n) 


S(m n) = p, (m7) + aa [y(m,,7,)— u, (mn )] (3.75b) 


which has a predictor-corrector structure and 0o2(n1, ny) / (o7(n,, n) +o me is called 
the filter gain. 

The adaptive LMMSE filter (3.75) requires estimation of the local mean u(n ， 72) 
and the local variance a(n, , 1,) at each pixel, which can be computed over an MX M 
local window WW centered at the pixel (7,,7,) as 


1 aye 
H, (m7) = KH, (m,m) = Wa Iih) (3.76) 


since the noise has zero-mean, and 


1 
M? 


oy (7) = 


2 
a ae (Iih) = aa) (3.77) 
In order to avoid a negative variance estimate, we have 
ó? (mm) = max{o (m7) — 0,0} (3.78) 


Note that when 6? is small, indicating a uniform image region, the filter gain is 
negligible, and the adaptive LMMSE filter approaches a direct averaging filter. On 
the other hand, when g? is large compared to 0,7, which indicates the presence of 
edges or high-contrast texture, the filter gain approaches one, and edges/texture are 
preserved by effectively turning the filter off. Consequently, some noise is left around 
the edges, which may not be visible due to the masking effect of human vision. How- 
ever, this may be visually disturbing in low SNR cases. 


Directional Filtering 


An alternative approach for edge-preserving filtering is directional filtering, where we 
filter along the edges, but not across them. The directional filtering approach may be 
superior to adaptive LMMSE filtering in low SNR cases, since noise around edges 
can effectively be eliminated by filtering along the edges, as opposed to turning the 
filter off in the neighborhood of edges. 


3:5 
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Figure 3.24 Directional filtering kernels. 


In directional filtering, possible edge orientations are typically quantized into 


four angles, 0°, 45°, 90°, and 135°, and five FIR filter kernels, one for each orienta- 
tion and one for non-edge regions, are defined. The supports of the edge-oriented 
FIR filters are depicted in Figure 3.24. In general, there exist two approaches for 


directional filtering: i) select the most uniform support out of the five at each pixel 


according to a criterion of uniformity or by edge detection [Dav 75], or ii) apply an 


edge-adaptive filter within each kernel at each pixel and cascade the results [Cha 85]. 


l. 


Method I: Kernel selection. If the variance of the pixels within the non-edge 
kernel T, is more than a pre-determined threshold, we decide that an edge is 
present at that pixel. Then, one of the edge kernels T, — T; with the lowest vari- 
ance (provided that it is lower than that of T,) is selected as the most likely edge 
orientation at that pixel. Filtering is performed by averaging pixels indicated by 
black dots in the respective kernel or by edge-oriented Gaussian filters, called 
anisotropic diffusion filters. Filtering along the edge direction avoids spatial 
blurring to a large extent. 

Method IT: Cascade. We use a spatially adaptive filter, such as the local LMMSE 
filter, within each of the five supports at each pixel. Recall that the local 
LMMSE filter is effectively off within those supports with a high variance, 
and approaches direct averaging as the signal variance goes to zero. Thus, when 
cascading five filters as T= T,T,T,T,T., where Ti is the local LMMSE filter 
applied over the respective kernel [Cha 85], effective filtering is performed 
only over those kernels with a small variance. This method avoids the support- 
selection problem. 


Directional filtering, using either method, usually offers satisfactory noise reduc- 


tion around edges, since at least one of the filters shall be active at every pixel. 
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3.5.4 Nonlinear Filtering: Order-Statistics, Wavelet 
Shrinkage, and Bi-Lateral Filtering 


Nonlinear filtering refers to a very large family of image-processing operations; 
namely, any operation that is not linear. Nonlinear filters can be classified as pixel- 
wise or block-wise filters. Here, we focus on only a few popular classes of non- 
linear denoising techniques, which are pixel-wise median/order-statistics filters and 
bi-lateral filters, and block-wise wavelet shrinkage filters. 


Median Filtering 

Median filtering is a nonlinear operation that is implicitly edge adaptive. The output 
of the median filter is given by the median of pixels within the support of the filter, 
expressed as 


5(,,n,) = Med{ y(i,,4,)} for (i,i)EB (3.79) 


(my) 
where B,, ,) denotes the filter support, e.g., a local neighborhood of pixel (7,,7,), 
and * Med” TAON the median operation. For example, in 3 X 3 median filtering, 
there are nine pixels in the 3 X 3 neighborhood of a pixel, and the output for the 
center pixel is given by the intensity of the fifth ranked (by intensity value) pixel. 
Larger filters may be needed to remove large clusters of impulses. The median filter 
is edge-preserving since it rejects outliers, avoiding blurring across edges [Ata 80, Arc 
91, Yin 96]. Fast algorithms for 2D (separable) median filtering exist [Ata 80]. 


Example: Median Filtering 


We demonstrate how median filtering preserves edges by means of a simple 
example of 1D filtering. The original signal, in the form of two step edges, is 
randomly contaminated by impulse noise added to the shaded samples. We 
observe that the 3 X 1 mean filter spreads the noise to neighboring samples, 
while the 3 X 1 median filter effectively removes all noise in this example. 


Original signal 10 10 10 10 10 10 40 40 40 40 40 80 80 80 80 80 80 
— ~ 一- 一 

Noisy signal 10 10 10 60 10 10 40 80 40 10 80 40 80 80 10 80 80 

Mean filtered 10: 27 27 27 20 43 53 43 43 43 67 G7 57 57 57 


Median filtered 10 10 10 10 10 40 40 40 40 40 80 80 80 80 80 
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Weighted Median Filtering 


In median filtering each sample in the filter support is given an equal emphasis. 
However, in some cases, median operation results in distortion around corners. The 
weighted median filter is an extension of the median filter where each sample (z,, z,) 
is assigned a weight w, ,). The weighting is achieved by replicating the (i,,i,)th 
sample w, ,) times. Then, the output of the weighted median filter is given by 

E 


5(n,,n,) = Med{w, OyG) for (4,4,) € B, (3.80) 


mom) 
where © is the replication operator. The properties of the filter vary depending on 
how weights are assigned. The reader is referred to [Yin 96] for further details. 

We can compare the properties of the median filter with those of the mean filter: 


1. The median of an odd number of samples Ny, is the sample with the smallest 
sum of absolute differences with other samples in a given set of samples. The 
sample mean is an estimate with the smallest square distance with all samples. 
Thus, sample median and sample mean provide an estimate B that minimizes 
the criterion D(B) = X; | ya —B |? for p = 1 and p = 2, respectively. 

2. The sample mean is the maximum likelihood (ML) estimate in the presence 
of Gaussian noise, whereas the sample median is the ML estimate in the pres- 
ence of Laplacian noise, which has heavier tails than Gaussian, i.e., impulsive 
noise. 


Order-Statistics Filters 


Order-statistics filters require rank ordering (sorting) pixel values in a neighborhood 
of the current pixel. An alpha-trimmed mean filter is an order-statistics filter that 
combines rank ordering and averaging operations to compute the output. In particu- 
lar, L smallest and L largest pixel values in a local neighborhood are eliminated, and 
the average of remaining middle ranked samples are computed as 


1 过 
i nm) = poy nin Neo (3.81) 


The alpha-trimmed mean filter becomes a mean filter for L = 0, and it approaches 
the median filter as L — Ny,/2. Therefore, it can be effective in suppressing a com- 
bination of Gaussian noise and impulsive salt-and-pepper noise by proper selection 
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of L. The median and order statistics filters can be made adaptive by adjusting the 
size of the local neighborhood to match local image and noise statistics. The reader is 
referred to [Arc 91] for details about multi-stage order statistic filters. 


Wavelet Shrinkage — Denoising Using Sparse Representations 


Natural images can be represented by a sparse vector in some transform domain, 
e.g., an orthogonal wavelet transform domain. Under sparse image modeling, addi- 
tive noise leads to very low SNR on many transform coefficients with small magni- 
tudes. A thresholding can simply detect and remove these coefficients resulting in 
robust and high-performance denoising. Denoising by wavelet shrinkage, proposed 
by Donoho [Don 95a], refers to hard or soft thresholding of orthogonal wavelet 
transform coefficients, where it is assumed that larger coefficients represent “signal” 
and small coefficients are “noise.” All wavelet shrinkage methods consist of the fol- 
lowing three main steps: i) linear forward wavelet transform, ii) nonlinear shrinkage, 
and iii) inverse wavelet transform. Many different wavelet shrinkage methods vary 
in the details of wavelet transform implementation and choice of shrinkage function 
and threshold value. 

The selection of wavelet basis functions (analysis and synthesis filters), the num- 
ber of resolution levels, and image-boundary handling in filtering all affect denoising 
performance. The nearly symmetric orthogonal wavelets are generally preferred for 
denoising, since orthogonal transform of white noise is white in the wavelet domain, 
and orthogonal transforms preserve the mean-squared error. Typical choices are 3 
or 4 resolution levels using Symlet8 wavelets and periodic or symmetric boundary 
extension [Fod 03]. 

The shrinkage functions can be classified according to i) whether they use hard 
or soft thresholding, where hard thresholding of wavelet coefficient w is defined by 


w=] w 应 | 可 > 和 iss 


0 otherwise 


and soft thresholding is defined by 


85(w) = | sgn(w)(|w|—A) if |w|>A (3.83) 


0 otherwise 


and ii) whether they use a universal threshold or adaptive thresholds for different 
resolution levels or subbands. They also vary according to the criterion used for 
determination of the threshold value A. Several different criteria exist to estimate the 
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best threshold, such as Stein’s unbiased risk estimate (SURE), minimax, and Bayes- 
ian criteria [Fod 03, Lui 07]. 

VisuShrink uses a universal threshold A = a, ,/2log N , where ø? is the variance 
of noise and N is the number of wavelet coefficients. SureSnrink selects an adaptive 
threshold for each sub-band to optimize the SURE criterion [Don 95b]. It com- 
bines universal threshold selection and scale-dependent adaptive threshold selection 
according to sparsity of subbands. The soft thresholding function is continuous; 
however, its first derivative is not continuous. Hence, Donoho’s method searches for 
the optimal threshold within a finite set. SureShink can be summarized as: Assuming 
the wavelet coefficients are normalized by an estimate of ø „ let w, denote the vector 
formed. by the normalized wavelet coefficients w,/o,, n= 1,...,N, in sub-band j 
and N, be the number of wavelet coefficients in subband j. For eri sub-band /, use 


a Reed universal threshold, given by 
3 
1 N, w, 2 aS (log, N,)? 
j n=l G3 JN; 


Ay = [2logN, if > 
了 


i.e., if only few wavelet coefficients are non-zero. Otherwise, use the SURE thresh- 


old, given by 


A =argmin,>o SURE(A, w ;) (3.84) 


of 


Cr 
and M, is the number of coefficients in sub-band j whose absolute value is less than 
A. Then, the denoised coefficients are given by ô, =a, Sw, JE): 

BayesShrink is another scale-adaptive threshold estimation method that mini- 
mizes the Bayes risk [Cha 00]. The threshold is given by A=02/ a, where a? is the 
variance of the noise-free signal (for each subband), which must be estimated from 
the noisy image. 


where 


2 





SURE(A,w,) =, —2M, +s 


n=l 














v 


Bi-Lateral Filter 


Bi-lateral filters perform combined domain and range filtering by some weighting 
of geometric proximity and similarity in the intensity or the CIE-Lab color space as 
introduced in Section 3.1.2. Bi-lateral filtering has been applied to image denoising 
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by proper choice of the kernel size NX N, domain parameter o p? and range param- 
eter 02 to match the noise statistics. The larger the range parameter, the closer the 
filter approaches Gaussian filtering. Hence, there is tradeoff between the amount of 
noise reduction and blurring of edges. In the presence of salt-and-pepper noise, a 
median filter may be applied prior to bi-lateral filtering. In bi-lateral filtering of color 
images, instead of the common practice of processing the luminance only, a percep- 
tual color similarity metric in the CIE-Lab color space can be employed to smooth 
colors without color bleeding and blurring. Range filtering in the CIE-Lab color 
space is a natural way of processing color images, where only perceptually similar 


colors are averaged, and perceptually important edges are preserved. 


3.5.5 Non-Local Filtering: NL-Means and BM3D 


Most image-denoising methods exploit correlations between pixel intensities within 
a local neighborhood of a pixel, i.e., they assume pixel intensities within nearby 
locations are similar to each other, while noise samples are uncorrelated. However, 
this assumption breaks down near edges or texture regions, where contrast and/or 
color of pixels change suddenly. As a result, modeling similarity by geometric prox- 
imity of pixel locations (in the domain of the image) causes blurring and/or color 
bleeding artifacts in image denoising. To overcome this problem, bi-lateral filtering 
combines domain and range filtering (see Section 3.5.4). Alternatively, non-local 
filtering methods exploit non-local self-similarities in an image, i.e., range (intensity) 
similarity over non-local image areas. We study non-local means filtering (a patch- 
based range filtering method), and BM3D (a patch-based processing method in the 
transform domain) in more detail in the following. 


Non-Local Means Filtering 


The non-local means (NLM) filter locates patches, defined as fixed-size small win- 
dows, which can be overlapping or non-overlapping, that are similar to a patch cen- 
tered at the current pixel, and denoise the current pixel by a weighted average of the 
center pixels of these similar patches. The main differences between bi-lateral and 
NLM filters are that the NLM filter replaces the single-pixel intensity similarity mea- 
sure in bi-lateral filtering with a fixed size patch-based intensity similarity measure, 
and ignores the geometric proximity measure to exploit non-local self-similarities. 
Buades et al. [Bua 05] show that NLM filters are very effective for denoising. 

We describe pixel-wise implementation of the NLM filter: Given a noisy image 
y(n), the filtered intensity value §(m) for each pixel n can be computed as a weighted 
average of all similar pixels k in the image, given by 
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=e 
C(n) 


s(n) = 2 kena) wn, k) y(k) (3.85) 
where C(n)=>,.w(n, k) is a normalizing constant and Mn) denotes a (2g +1) X (2g +1) 
window centered at pixel n to search for similar patches. The weights w(n, k) 
depend on the sum of the absolute value of pixel-by-pixel similarity comparison of 
(2r+ 1) X (2r+ 1) patches P, centered at pixels n and k, respectively, given by 


1 
2 = —i)— y(k — i)? 3.86 
d (n,k) Ory > A —i)] (3.86) 


The patch size is typically 5 X 5 (r= 2), except maybe in very noisy images. The 
search window has a limited range for computational efficiency reasons. The size of 
the search window typically varies between 21 X 21 (q = 10 for moderate noise) and 
35 X 35 (q = 17 for large noise). The larger the number of similar pixels, the better 
the noise reduction. In order not to exclude the current pixel from its own estimate, 
the distance between the window centered at the current pixel and itself, d(n,n) is 
set equal to the minimum of all other distances. The intensity similarity between 
two patches can also be weighted by a Gaussian kernel such that pixels closer to the 
center should have more weight in patch comparison. 

The weights are designed to allow averaging center pixels of similar patches, 
which differ up to noise level with equal emphasis. If the noise samples are i.i.d. with 
zero mean and variance 0”, the maximum distance between two identical patches 
due to noise can be 20%. That is, weights for patches with square distances smaller 
than 207 are set to 1, while weights for patches with larger distances decrease rapidly 
according to the exponential rule 


_ max{d?(n)—207 0} 


w(n,k) =e ” (3.87) 


where the parameter / controls the decay of the weights as a function of intensity 
dissimilarity. If no matching patches are found for a pixel to satisfy d?(n)<20?, the 
filter may not alter the value of the pixel leaving the noise unprocessed. 


Block-Matching 3D (BM3D) Filtering 


Katkovnik et al. [Kat 10] classify noise filters as local/non-local and pixel-wise vs. 
block-wise (multi-point) filters. While the NL-means filter is an example of non- 
local, pixel-wise filtering, the BM3D is a non-local, block-wise filtering method, 
where possibly overlapping similar 2D-image blocks are grouped into 3D arrays. 
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BM3D applies collaborative filtering, which is a 3D-transform domain filtering, to 
such 3D arrays (groups). It is a block-wise filter, as opposed to a pixel-wise filter, 
since it outputs a group of processed 2D-image blocks, which are further aggregated 
to form the final image estimate. More specifically, BM3D consists of three main 


steps: 


1. Formation of groups: Given the current block N, similar blocks N, are selected 
and stacked together as a 3D array. If all blocks contain the same structure 
except for the noise, then averaging pixels in the same location over all blocks 
would be the optimal estimator. In the more realistic case, there are some minor 
structural differences between the blocks that require collaborative filtering in 
the transform domain. 

2. Colloborative filtering: The filter consists of 3D transformation of 3D groups, 
shrinkage of transform coefficients, and inverse 3D transformation. Similarity 
within the 3D groups of blocks implies that the resulting transform will be 
sparse. Shrinkage of transform coefficients attenuates noise while keeping the 
fine details shared by the group of blocks. Furthermore, significant improve- 
ments can be obtained by collaborative Wiener filtering [Dab 07]. The result is 
a 3D estimate that consists of a jointly filtered group of image blocks. 

3. Aggregation: The filtered 2D blocks are returned to their original positions. 
Because these blocks overlap, we obtain multiple estimates for each pixel, which 
are combined by a weighted averaging procedure to form the final denoised 
estimate for each pixel. 


BM3D has been shown to achieve state-of-the-art denoising performance in 
terms of both peak signal-to-noise ratio and subjective visual quality [Dab 07]. 
Extensions of BM3D to shape adaptive transform domain filtering and PCA on 
these adaptive-shape neighborhoods exist [Kat 10]. 


3.6 Image Restoration 


Image restoration refers to deblurring of images degraded by optical smearing, 
including linear motion, out-of-focus imaging, or camera shake. Optical blurring 
arises when a single point object spreads over several image pixels, which may be 
due to relative motion between the object and camera or out-of-focus imaging. 
The extent of the spatial spread is determined by the point-spread function (PSF), 
which is the impulse response of the imaging system. In the discrete spatio-temporal 
domain, the general restoration problem can be formulated as solving a (possibly 
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underdetermined) set of simultaneous linear equations, which simplifies to a decon- 
volution problem in the case of spatially invariant blurs. 


3.6.1 Blur Models 


This section introduces the concept of a point spread function and discusses shift- 
invariant and shift-varying modeling of spatial blurring by the convolution and 
superposition summation, respectively. A vector-matrix model is also presented. 
Blurring in the temporal direction is often negligible, and will not be considered. 


Point-Spread Function 


Modeling a point source (object) by a 2D impulse, the PSF defines the impulse 
response of an imaging system. We present PSF models for the most common 
sources of blur: out-of-focus blur, motion blur, and camera shake. 


Out-of-Focus Blur 


An ideal camera would form a point image at the focal plane of the camera for a 
point source. However, if the camera is out-of-focus, i.e., if the image plane (sen- 
sor) is moved away from the focal plane, then we observe a circle (with finite radius) 
image for a point source as depicted in Figure 3.25. This circle is commonly known 
as the circle of confusion. 

Thus, we model the PSF of an out-of-focus imaging system with a uniform cir- 
cle, called the circle of confusion, given by 


1 2 2 2 
E E E 
bendi oR S ea (3.88) 
0 otherwise 
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Figure 3.25 Illustration of out-of-focus blur formation and circle of confusion. 
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Figure 3.26 PSF of out-of-focus blur. 


where the radius of the circle, depicted in Figure 3.26, indicates the amount of out- 
of-focus. The discrete PSF b(n,n, ) is obtained by sampling of A(x,,x, ) by means of 
integrating the area under the continuous PSF within each pixel element. 

The frequency response of the out-of-focus blur is given by the Fourier transform 
of A(x, ,x, ) as 


H (45%) = oa J, (r Jut uz) (3.89) 


a u 


where /,(-) stands for the Bessel function of the first kind and order one, which has 
regular zero-crossings. Note that H(w,,u,) is circularly symmetric. 


Linear-Motion Blur 


Suppose there is relative translation between the camera and the scene at a constant 
velocity v along direction 0 with the horizontal axis of the image plane during the 
exposure interval [0,4]. Lets define the extent of the motion as A=vz, Then, the 
PSF of linear motion blur is given by 


1 A x 
if 2 2 一 = ke 风琴 = 6 

Meat ee ge (3.90a) 
0 otherwise 


In the special case, where the motion is parallel to the horizontal axis in the 
image plane, we have a 1D PSF given by 


n ae eee a ee = 
b(xisx3)=1 A a 2 i ? al (3.90b) 
0 otherwise 


The frequency response of the 1D horizontal uniform motion blur can be 
expressed as 
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DfA 
sin| 7 — u 
H(u,,u,)= (3.91) 


wt, 
2 


which has periodic zero-crossings. Discretization of the PSF is again accomplished 
by integrating the area under /(x,,x,) over each pixel element. Motion blur becomes 
visible in scenes with fast action, such as in freeze frames of sports video. 

Camera shake is another common source of image blur that can be modeled by 
juniform motion provided that camera rotation is negligible [Fer 06]. 


Space-Invariant Spatial Blurring 


If we have a linear space-invariant (LSI) blur, the blurred image can be modeled as 
the output of a linear filter whose impulse response is not a function of position 
within the image. This means the image is blurred exactly the same way at each pixel 
position, i.e., there is no significant parallax and image-plane rotation of the camera 
is small. A spatially invariant blur can be modeled by 2D convolution of an image 
with the spatial PSF b(n, n,) given by 


y(n, ,2,) = h(n,,n,) **s(2,,2,) + v(m ,7,) (3.92a) 


where y(n1, 75), s(#,,”,), and v(7,,,) denote the degraded image, ideal image, and 
noise, respectively. We assume that 


1. A(n,,n,) is real and non-negative for all (7, ,n,) due to physics of image forma- 
tion, and 


Z. pe Pe b(n aK) = l; i.e., no energy is gained or lost due to image blurring. 


The model (3.92a) can be expressed in vector-matrix form as 


wa (3.92b) 
where y, s, and v denote the observed image, the original image, and the noise 
that are lexicographically ordered into N? X 1 vectors (for an NX N frame), 
respectively, and the matrix H characterizes the blur PSF in operator form. 
Note that for a spatially shift-invariant blur, H is a block-Toeplitz matrix. 
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Shift-Varying Spatial Blurring 

Image formation in the presence of a linear shift-varying (LSV) blur can be modeled 
as the output of a 2D-LSV system, denoted by L. We define the space-varying PSF 
of this imaging system as 


A(n, 234,52, ) = L{6(n, = 6302, =A} (3.93) 


which represents the image of a point source located at (i,i,), and may have infinite 
extent. It is well-known that a discrete image can be represented as a weighted and 
shifted sum of point sources as 


s(n, n) = Joka es Slit) 5(n, 一 看 ,72 —i,) (3.94) 


where s(i,,i,) denotes the samples of the image and S denotes the image support. 
Then, applying the system operator to both sides of (3.94), using (3.93), and includ- 
ing the observation noise, v(7,,7,), the blurred frame can be expressed as 


Wy) = LG ges (hb) A(,,234,53,) + v(m, 了) (3.95) 


For an NV, XN, observed blurred image, Eqn. (3.95) yields MIX N, coupled 
linear equations with N, X N, unknowns s(i,,2,), assuming that the PSF is known. 
Hence, the restoration problem can be formulated as solving an inconsistent set of 
N, X N, simultaneous equations in the presence of noise. In the case of LSI blurs, 
the system matrix is block-Toeplitz; thus, the NV, X N, equations can be decoupled, 
under the block-circulant approximation, by applying the 2D discrete Fourier trans- 
form. In the LSV case, however, the system matrix is not block-Toeplitz, and fast 
methods in the transform domain are not applicable. 

Based on these models, we can pose the following image restoration problems: 


1. Shift-invariant restoration: Given the image formation model (3.92) and the 
shift-invariant PSF b(n, n,), find an estimate §(7,,7,) of the image. 

2. Shift-varying restoration: Given the image formation model (3.95) and the 
space-varying PSF A(n,,7,;7,,7,), find an estimate §(7,,”,) of the image. 


The problem is called blind-image restoration (see Section 3.6.3) if the PSF is 
unknown. The intraframe image-restoration problem is extended to multi-frame 
video restoration in Chapter 6, where it is formulated as: given a sequence of cor- 
related frames find an estimate of all frames processing them simultaneously. 


3.6 Image Restoration 165 


3.6.2 Restoration of Images Degraded by 
Linear Space-Invariant Blurs 


Image restoration is defined as the process of undoing image blurring (smearing) 
based on a mathematical model of its formation. Because the model (3.92) is gener- 
ally not invertible, restoration algorithms employ regularization techniques by using 
a priori information about the ideal image. Many methods exist for image restora- 
tion depending on the amount and nature of the a priori information used. The clas- 
sical techniques are pseudo-inverse filtering, Wiener and constrained least-squares 
(CLS) filtering, the constrained iterative method [Sch 81], the maximum a posteriori 
probability (MAP) estimation method [Tru 79], and the projection onto convex sets 
(POCS) method [Tru 84, Tru 85]. In this section, we first discuss pseudo-inverse 
filtering and CLS/Wiener filtering. The reader is referred to [Sez 90] for a review of 
classical image-restoration methods. We then introduce restoration methods based 
on sparse-image modeling, which have recently been proposed [Ela 10]. 


Pseudo-Inverse Filtering 


If we neglect the noise in the degradation model (3.92), i.e., if the vector y lies in the 
column space of the matrix H and the matrix H is invertible, an exact solution can 
be found by direct inversion: 


s=H'y (3.96) 


where § denotes an estimate of the ideal image. The operation (3.96) is known as 
inverse filtering. However, because inverse filtering using (3.96) requires a huge 
matrix inversion, it is usually implemented in the frequency domain. 

Taking the 2D (spatial) Fourier transforms of both sides of (3.92), ignoring the 


noise term, we have 


VPA) = AMAA) SAA) 


Then, inverse filtering can be expressed in the frequency domain as 


人 _~Vh>f) 
SUL. f= 3.97 
RA HEP (3.97) 


The implementation of (3.97) requires sampling of the frequency variables 
(ff), which corresponds to computing the discrete Fourier transform (DFT) of 


the respective signals. Note that sampling in the frequency domain results in spatial- 
domain aliasing given by (3.69). 
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The relationship between (3.96) and (3.97) can also be shown by diagonalizing 
the matrix H. Because NV? XN? matrix H is block-Toeplitz for spatially shift-invar- 
iant blurs, it can be diagonalized under the block-circulant approximation as 
H = WAW-!, where W is the N?XN? 2D-DFT matrix that contains - N? parti- 
tions of size NX N. The imth partition of W isin the form W =e re aha cal 
Ln=0,...,N— 1, and A is a diagonal matrix formed by the eigenvalues of H, which 
are given ra the 2D-DFT of 4[n,,7,] [Gon 07]. Then, premultiplying both sides of 
(3.96) by W~!, and inserting the term WW! between H~! and y, we have 


W's=(W 'H'W)W''y 


s (kk) = A 
H(k,,k,) 
where multiplication by W~! computes the 2D-DFT, and k, and k, denote the 
discretized spatial-frequency variables; hence, the result is equivalent to (3.97). 
Inverse filtering has some drawbacks. First, the matrix H may be singular, which 
means at least one of its eigenvalues H(k,,£,) may be zero, resulting in a division by 
zero in the 2D-DFT implementation (3.97). Even if the matrix H is non-singular, i.e., 
H(k,,k,) #0, for all (&,,£,), the vector y almost never lies in the column space of the 
matrix H due to the presence of observation noise v; thus, an exact solution for (3.96) 
does not exist. Then, one must resort to a least-squares (LS) solution that minimizes the 
norm of the residual y—Hs. The LS solution, known as pseudo-inverse filtering, given by 


§=(H"H) 'H’ y (3.98) 


exists if the columns of the matrix H are linearly independent. The pseudo-inverse 
filter can be implemented in the frequency domain by (3.97), where division by zero 
is defined as zero. 

Deconvolution by pseudo-inversion is ill-posed due to the presence of observa- 
tion noise. This is because the pseudo-inverse of the blur transfer function usually 
has very large magnitude near those frequencies where the blur transfer function has 
zeros and at high frequencies. This results in excessive noise amplification at those 
frequencies that can be alleviated by regularized inversion. 


Regularization by Constrained Least Squares — Wiener Deconvolution 


Regularized deconvolution methods use a priori information about the image to roll 
off the transfer function of the pseudo-inverse filter near singular frequencies and at 
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high frequencies in an attempt to limit noise amplification. However, the regularized 
filter inevitably deviates from the exact inverse at those frequencies, which leads to 
other kinds of artifacts, known as regularization artifacts [Tek 90]. Regularization 
of the inversion can be achieved by deterministic optimization (constrained least 
squares) or by statistical estimation (Wiener filtering or MAP estimation) methods. 
Te reader may refer to Appendix A for an overview of regularization methods. 

The CLS method aims to minimize the /?-norm of the high-frequency content 
in the restored image ||L s||?, which requires it to be as smooth as possible, while 
ensuring the solution is consistent with the observations, i.e., satisfy the degradation 
model. Hence, it is formulated as a constrained optimization problem given by 


. 2 
min, || Ls ||} 


subject to ||y — Hs| b =o (3.99a) 


where L is a regularization operator and g? denotes the variance of the noise. The 
operator L is typically chosen to have high-pass characteristics (e.g., the Laplacian 
filter) so the CLS method finds the smoothest image (i.e., the image with the small- 
est high-frequency content) which statistically complies with the observation model 
(3.92). This problem can be converted into an unconstrained optimization problem 
using the Lagrangian formulation (see Appendix A), 


min, E(s), where E(s) =||Ls ||? +A(|| y — Hs ||2— 02) (3.99b) 


and A is the Lagrange multiplier. Differentiating E(s) with respect to s and setting 
the result equal to zero, we get 


@(§) = 5 VEC) =L §—AH" (y—Hs)=0 (3.100a) 
Solving for $, we obtain the CLS filter, which can be expressed as 
$ = (H'H +aL'L) 'H'y (3.100b) 
where a =1/A is called the regularization parameter, which controls the tradeoff 
between smoothness of the solution and fidelity to the observations. It can be read- 


ily seen that the pseudo-inverse filter (3.98) is a special case of the CLS filter with 
a = 0. Direct implementation of (3.100) requires a huge matrix inversion. There are 
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two approaches to an efficient implementation: DFT domain implementation and 
spatial-domain successive iterative approximations. 


DFT Domain Implementation 
We premultiply both sides of (3.100b) by the 2D-DFT operator W`! to arrive at the 


frequency-domain filter expression after some matrix manipulations: 


H (kk) 


P kje eee a a 
(kiska) |H(z,4) +a | Lh)? 


Y (k,,k,) (3.101) 


where L(k,, k,) denotes the eigenvalues of the regularization operator L. Eqn. (3.101) 
can be efficiently implemented in the DFT domain. The operator L can be defined 
in terms of a 2D shift-invariant generating kernel (impulse response), and (k, k,) 
is the 2D-DFT of this kernel. 


Successive Iterative Approximations 


Successive iterations is a gradient-descent method (covered in Appendix C) to opti- 
mize E(s) by taking a step in the negative direction of the gradient ®(8). The solution 
is given by 


Spa = 8, B D(S,) 


where £ is called the step size, with the initial condition §,=0. Clearly a root of ®(8) 
is a fixed point of the iteration. Substituting (3.100a) into the successive approxima- 
tion iteration yields 


Sı =8, +B H'y—B(H'H+al'L)s, = 6 H'y+(I—B(H'H+al'L)Js, (3.102) 
which for a=0 reduces to the Landweber iteration that converges to the least- 
squares solution. With any iterative algorithm, there are two important concerns: 


i) does it converge, and if so, ii) what is the limiting solution? We answer these using 
frequency-domain analysis. Let 


Smf f =BA GANG- fd 
H1- BHF, f) P tal LE, f PIRE, f) 


It is easy to show that the frequency response of the filter at the kth iteration is 
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k— 


ne pe 0- BIA AE +alL A PE fof) 


= 
Hence, if 


l1- BUA A) P +e|LA fa PN< (3.103) 


then 


. H (fofa) 
li TJ) = FP al FAP 
Mss Leto Jy) le COT el ER BN 


Therefore, the iterations converge to the CLS solution provided that (3.103) is 
satisfied. 


Choices for the Regularization Operator 


-Common choices include the identity operator, Laplacian operator, and Wiener 
operator. The former applies a fixed regularization parameter independent of image- 
frequency components, where we have L = I, and I is the identity matrix. The 
2D-generating kernel in this case is the Kronecker delta function /[1,,7,] = 6[n,, 7,] 
and L(k,,k,) =1. The Laplacian operator applies frequency-dependent regulariza- 
tion, where the frequency response of the CLS filter rolls off at high frequencies. 
The generating kernel is a discrete approximation / [2,,7,] to the Laplacian, which 
is discussed in Section 3.3.2. Then, L(k,,,) is the 2D-DFT of a particular discrete 
approximation /[,, 7]. 

The Wiener deconvolution filter, which can be derived by following the steps 
shown in Section 3.5.2 based on the observation model (3.92), is a special case of the 
CLS filter with L=R,! R, or |L (ki ,)|? =P (k i» k,)/ P. (ki, k), where R and R, 
denote the covariance matrices of the ideal image and the noise, and Pi (k k,) and 
P „(Ri ko) denote their power spectra, respectively. Recall that the Wiener filter gives 
the linear minimum mean-square error (LMMSE) estimate of the ideal frame given 
the observation model. However, it requires 4 priori information about the image 
and noise in the form of their power spectra. 


Regularization by Sparse-Image Modeling 


Sparse-image models have recently been introduced as a detail-preserving alternative 
to the classical smoothness models for regularization of inverse problems including 
image restoration, super resolution, and in-painting [Ela 10]. The reader should refer 
to Appendix A for a discussion of sparse models. 
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We have studied the orthogonal wavelet transform (DWT) shrinkage method as 
a simple application of sparse-image modeling to image-denoising problem in Sec- 
tion 3.5.4. This approach has been extended to image restoration within an iterative 
expectation-maximization (EM) framework, which alternates between an E-step for 
deconvolution based on the fast Fourier transform (FFT) and a DWT-based M-step, 
which enforces sparsity of the solution, resulting in an efficient iterative process 
requiring O(N log N) operations per iteration [Fig 03]. 

A mathematical formulation that includes several independently developed 
image-restoration methods, including the EM framework, based on sparse image 
modeling is given by [Ela 10]. In this formulation, the sparse-representation coef- 
ficients of the restored image can be determined from 


å = arg min, {A|]a||? ++ || HDa — y ||3} (3.104) 
p 


where Da is a sparse-image representation, H is the degradation matrix, and y 
denotes the noisy and blurred observations. This problem can be solved by using the 
framework of iterative shrinkage threshold (IST) algorithms, given by 


Ge =argmin, {Allalli +7 |]a—B, |b} (3.105a) 
which alternates between two simple steps: i) E-step: update B, using 
B, =å, —yD'H' (HD&, — y) (3.105b) 


which can be implemented by O(N logN) operations, and ii) M-step: a scalar 
shrinkage of B, given by Eqn. (3.82) The overall process is fast and effective. Several 
IST-like fast algorithms have also been proposed [Dia 07, Bec 09]. 


Exploiting Non-Local Self-Similarities 


Sparse image modeling for image restoration has been extended to exploit non-local 
image self-similarities [Dan 12, Don 13b]. In particular, [Dan 12] extends BM3D 
image modeling to iterative decouple deblurring (IDD-BM3D), which aims to achieve 
generalized Nash equilibrium balance between how well the restored image fits the 
observations and the sparseness of the solution. Alternatively, [Don 13b] introduces 
non-locally centralized sparse representation based restoration, which can be solved by 


IST algorithms. 
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Boundary Problem in Image Restoration 


Image sensors have a finite field of view. Since the values of the blurred pixels on the 
borders of the field of view (image) depend on some pixels outside the field of view, 
part of the information necessary to restore border pixels is not available. Although 
it seems that the effect of this problem would be limited to only border pixels, in 
general the impulse response of restoration filters is quite large and the effect propa- 
gates to almost all pixels in the image. We alleviate this problem as follows: In the 
space-domain implementations, we extrapolate the value of unavailable pixels from 
border pixels, usually by replication of the first/last row/column. In DFT domain 
implementations, because the images are assumed to be periodic, we interpolate 
between the left and right and top and bottom boundary pixels to estimate the 
unavailable pixel values. 


3.6.3 Blind Restoration — Blur Identification 


Blind restoration refers to the case where the PSF of the blur is unknown. Since both 
the PSF and the original image are unknown, blind restoration is an underdeter- 
mined (ill-posed) problem, which cannot be solved without strong assumptions on 
the PSF or the original image or both. 


Blur Identification from Zero-Crossings in the Fourier Domain 


A common model for natural images assumes that their Fourier spectrum does not 
have zero-crossings. It is then possible to estimate the PSF of out-of-focus and linear 
motion blurs from the spectrum P VOD w) of the blurred image, because they intro- 
duce regular zero crossings, a signature of the PSF, in the spectrum of the degraded 
image. For linear shift-invariant blur PSFs, we can write the power spectrum of the 
blurred image as 


P,(@,,@,) =| H(@,,,) |) P(@,,@,) (3.106) 
Assuming P (@,,@.,) does not have zero-crossings, the zero-crossings of P (0 03) 


are due to those of H(w,,w,). We estimate the power spectrum of the ohiened 
image in the DFT domain using the periodogram method given by 


P [k,,k,]= 








AIP (3.107) 
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where Y[ki, k,] is the DFT of the noise-contaminated N, X N, image. The cepstrum 
is defined as the inverse discrete Fourier transform of the logarithm of the image 
power spectrum, given by 


c[72 ,72 ] 三 —IDFT (log{P, [&,k,]}) (3.108) 


Note that it is easier to identify zero crossings in the cepstrum domain, since the 
logarithm enhances the visibility of the location of the zero crossings. 


Example: Identification of Uniform Motion and Out-of-Focus Blurs 


Modeling linear motion blur by a uniform rectangular PSE its frequency 
response is given by a sinc function, which has periodic zeros. In the case of 
an 8X1 horizontal motion blur and 128 X128 DFT, there are seven zero 
crossings at each row of the 2D-DFT occuring at the samples k = 16, 32, 48, 
64, 80, 96, and 112. Similarly, modeling an out-of-focus blur by a uniform 
circular PSF, its frequency response is given by a Bessel function, which has 
regular zeros as a function of the radius of the circle. Then, we can estimate 
the radius of the PSF from the zero crossings of the power spectrum of the 
degraded image. 


Blur Identification Using Parametric Models 


A maximum-likelihood formulation for parametric blur identification has been pro- 
posed, where the image is modeled as a 2D auto-regressive model and the blur is 
modeled by a FIR filter [Pav 92]. This method works especially well for blurs that do 
not have zeros in the frequency domain, such as Gaussian blurs. 


Blur Identification in the Image-gradient Domain 


A recent approach proposes blur identification based on image gradients rather than 
image intensities [Fer 06]. Since both differentiation and convolution are linear 
operators, the effect of blur in the gradient domain can be modeled as 


Vy (m1)= A(m,2,) ** Vs(m,,7,) 


where V denotes the spatial-gradient operator. The a priori distribution of gradient 
of sharp natural images is modeled by a zero-mean, heavy-tailed mixture of Gauss- 
ians model, resulting in a maximum a posteriori probability (MAP) estimation prob- 
lem to find b(n,n,) [Fer 06]. 
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3.6.4 Restoration of Images Degraded 

by Space-Varying Blurs 

In comparison with the amount of work on linear space-invariant (LSI) image res- 
toration, the literature reports only a few methods for the restoration of images 
degraded by linear shift-varying (LSV) blurs. Nevertheless, in real-world applica- 
tions degradations are often space-varying; i.e., the degrading system PSF changes 
with spatial location over the image. Examples of LSV blurs include motion blur 
when the relative motion is not parallel to the imaging plane or contains acceleration 
and out-of-focus blur when the scene has significant depth variation. Space-varying 
restoration methods include coordinate transformations [Saw 74], sectional process- 
ing [Tru 78, Tru 92], and iterative methods [Sch 81]. The coordinate transformation 
approach is applicable only to a special class of LSV degradations that can be trans- 
formed into an LSI degradation. In sectional methods the image is sectioned into 
rectangular regions, where each section is restored using a space-invariant method, 
such as the maximum a posteriori filter [Tru 78] or the modified Landweber iterative 
filter [Tru 92]. 

Here, we present the projections onto convex sets (POCS) formulation that is 
applicable to any shift-varying blur type and size. The POCS framework has also 
been used for LSI image restoration. In [Tru 84], Trussell and Civanlar use vari- 
ance of the residual constraint for restoration of space-invariant blurs. However, this 
constraint cannot easily be extended to space-variant restoration since it involves 
inversion of huge matrices that would not be Toeplitz for space-varying blurs. Later, 
Sezan and Trussell [Sez 91] develop a general framework of prototype-image-based 
constraints. Although the framework is general, specific constraints proposed in [Sez 
91] have been designed for space-invariant blurs. 


Overview of the POCS Method 


In the POCS method, the unknown signal, s, is assumed to be an element of an 
appropriate Hilbert space. Each a priori information or constraint restricts the solu- 
tion to a closed convex set. Thus, for m pieces of information, there are m closed 
convex sets CEH, i= 1,2,...,m, and s E C, =, C; provided the intersection C) 
is non-empty. 

Given the sets C, and their respective projection operators P, the sequence gen- 
erated by 


Spit SPRIE Eia Ae k ='051,.... (3.109a) 


m m 
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Figure 3.27 Method of projection onto convex sets. 


or, more generally, by 
Say = TT Li k= 0 (3.109b) 


where T,=(1—A,)I +A, P, 0<A,<2 is the relaxed projection operator, converges 
weakly to a feasible solution in the intersection Co of the constraint sets. Convex sets 
and successive projections are illustrated in Figure 3.27. Indeed, any solution in the 
intersection set is consistent with the a priori constraints and therefore is a feasible 
solution. Note that T, reduces to P, for unity relaxation parameter, i.e., A,=1. The 
initialization so can be arbitrarily chosen. It should be emphasized that the POCS 
algorithm is in general nonlinear, because the projection operations are in general 
nonlinear. For more detailed discussions of the fundamental concepts of the POCS, 
the reader is referred to [Tru 84]. 


Restoration Using POCS 


The POCS method can readily be applied to the space-varying restoration problem 
using a number of space-domain constraints that are defined on the basis of the 
observed image and a priori information about the space-varying degradation pro- 
cess, the noise statistics, and the ideal image itself [Ozk 94]. 

Assuming that the space-varying blur PSF and the statistics of noise are known, 
we define the following closed, convex feasible solution (constraint) sets (one for 
each observed image pixel): 


Cyn, = {2G .4)|lr"(%,,2,)| <5} OS, =N,-1,052,5N,—-1 (3.110a) 
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where 
“m= aal Donte 24.8.) Mh) (3.110b) 


is the residual associated with solution z, which is an arbitrary member of the set. 
The quantity 5, is an a priori bound reflecting the statistical confidence with which 
the actual image is a member of the set C, „ If s denotes the ideal image, then 
4(n,,2,) = v[n,,n,] and statistics of *(7,,2,) should be identical to that of v[7,,7,]. 
Hence, the bound 6, is determined from the statistics of the noise so that the ideal 
(actual) solution is a member of the set within a certain statistical confidence. For 
example, if the noise has Gaussian distribution with the standard deviation ø 5, is 
set equal to c o, where c = 0 is determined by an appropriate statistical confidence 
bound (e.g., c=3 for 99% confidence). The bounded residual constraint enforces the 
estimate to be consistent with the observed image. In other words, at each iteration, 
the estimate is constrained such that at every pixel (7,,7,), the absolute value of the 
residual between the observed image value y[n', 1,] and the pixel value obtained at 
(2,,2,) by simulating the imaging process using that estimate is required to be less 
than a predetermined bound. 

The projection P,» {x[i,,,]} of an arbitrary x[i,,7,] onto C, „ can be defined 
as [Tru 84]: 


R Phase 


r Pa te) Po 


Ber ta, Oe) 
x[é, 52, ] = =r"*(42,)=0, G11) 

r*(n,,n,) +5, 
DA 2 h” (1, 5,30, 30) 


plmmiiss) ifr (mm) > 0, 


x[#,,4,]+ Alnomsi b) Er G0) =<; 


Additional constraints, such as bounded energy, amplitude, and limited support, 
can be utilized to improve the results. For example, the amplitude constraint can be 


defined as 
C= felii) la E zh) =p for0S4=N,—-1, 054 =N,-1} (3.112) 


with amplitude bounds of a=0 and B=255. The projection P} onto the amplitude 
constraint C} is defined as 
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0 if x[i,,i,]<0 
P {xli.i]}=4 ii] if OS x14,4,]< 255 (3.113) 
255 if x[z,,4,] > 255 


Given the above projections, an estimate §,(i,,2,) of the ideal image s(z,,,) is 
obtained iteratively as 


Senli] a Ti Tr, -1,,-1 2 y,-2,0,-1 > ‘Too lS, [ż 4, ]} (3.114) 


where k = 0,1,...,.0 = i, = N — 1,0 Si, =N, — 1,-and T denotes the gener- 
alized projection operator defined in (3.111), and the observed noisy and blurred 
image y[z,,2,] can be taken as the initial estimate 5,(2,,2,). Note that in every iteration 
cycle k, each of the projection operators 工 ， 
they are implemented is arbitrary. 


_ is used once, but the order in which 


n 


3.6.5 Image In-Painting 


Image in-painting refers to reconstruction/concealment of damaged or missing areas 
in an image in a way that cannot be distinguished from the original. It has many 
applications including removal of scratches, text overlays or an undesired object, 
concealment of image transmission losses such as packet losses, and disocclusion in 
image-based rendering of intermediate virtual camera viewpoints. The term origi- 
nates from manual restoration of artwork, such as paintings or old photographs, by 
skilled artists and is also known as “retouching.” 
Image in-painting problem can be modeled as 


y=Ms+v 


where M is an NX N diagonal mask matrix with ones for existing pixels and zeros for 
the missing ones along the diagonal. Digital in-painting differs from image interpo- 
lation, which often refers to re-sampling of a uniformly sampled data set. In-painting 
is clearly an ill-posed problem with many plausible solutions. The objective is that 
the reconstructed regions look natural and physically realistic to the human eye. In- 
painting methods can be classified as diffusion-based, statistics and examplar-based, 
and sparse-representation-based methods [Gui 14]. 

Diffusion-based in-painting introduces smoothness priors by using paramet- 
ric models or partial differential equations (PDEs) to propagate (or diffuse) local 
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structures, such as straight lines or curves, from the borders of a missing patch 
toward the interior. Many variants exist using different linear, nonlinear, isotropic, 
or anisotropic models to favor propagation in particular directions or to take into 
account the curvature of structures present in a local neighborhood. These methods 
are well suited for in-painting small regions. They avoid generating unconnected 
edges; however, they tend to blur textures and large areas. 

Statistics and examplar-based methods exploit image self-similarity and statisti- 
cal models. The statistics of image patches or textures are assumed to be stationary 
(for random textures) or homogeneous (for regular patterns). Local region-growing 
methods grow texture one pixel or one patch at a time, while maintaining coher- 
ence with nearby pixels. The principle of examplar-based methods is searching for 
a patch that is most similar to the known part of the missing patch, and copy and 
paste the central pixel for pixel-based approaches or a set of pixels for patch-based 
approaches. 

With the recent growing interest in low-rank and sparse-image representations, 
sparse priors have also been used for solving the in-painting problem. The image is 
assumed to be composed of low-rank and sparse components in a given basis, e.g., 
DCT or wavelets. Known and missing parts of the image are assumed to share the 
same representation. The missing region is synthesized as a sparse linear combination 
of elements from an overcomplete dictionary [Gui 14]. 

Example-based and sparse-based methods are better suited to fill large missing 
texture areas than diffusion-based methods. Hybrid methods, which combine struc- 
tural (geometrical) and textural components, have also emerged. 
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Exercises 


Problem Set 3 


3.1 At how many standard deviations does a Gaussian fall to 5% of its peak value? On 
the basis of this, suggest a suitable kernel size parameter NV for a Gaussian filter 


2 2 
n +n 


roe E sä | -N<n <N,—-N<n, <N 


given the value of ø. State how to determine the parameter K. 
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3.2 Find the frequency response of the zero-order hold filter (3.16) and the lin- 
ear interpolation filter (3.18) for L = 2. Compare them with that of the ideal 
(band-limited) interpolation filter. 


3.3 In this problem, we design 1D and 2D cubic-spline interpolation filters. We 
first design a continuous-time cubic spline reconstruction filter that approxi- 
mates the ideal low-pass filter with the cutoff frequency f. = 1/2. The impulse 
response h(t) of a filter that uses four neighboring samples at any time t can be 
expressed as 


a 二 西国 tata os| 
W= atb tAdth 1S2 
0 25 lè] 


The design criteria are: /(t) = 0 for all integer ż (i.e., original sample points), 
except t= 0 where /(0) = 1, and the slope of the impulse response dh(t)/dt 
must be continuous across original sample points. 

a. Show that the design criteria are met if the eight unknown coefficients a, 


and b, i=0, ..., 3 satisfy the following seven equations: 
4, > 
4,=0 
a,+ As = =l 


bt pb b= 
8b,+ 46,+ 26,+ b, =0 
126.7 46,7 6, = 0 
3a F 2d, = 36,1 2b,+ 0, 
b. Let b, be a free parameter, and express all other parameters in terms of 6,. 
Determine a range of values for b, such that 


d’ hit) d* hit) 
< 0 and 
de ee tas 


t=0 t=1 





>0 





c. A digital cubic spline filter can be obtained by sampling A(z). Show that for 
interpolation by a factor of L, the impulse response of the cubic-spline filter 
has 4L + 1 taps. Determine the impulse response for L = 2 and 6,= — 1. 

d. Design a 2D separable cubic spline interpolation filter with the same 
parameters. 
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3.4 Consider the following image, where each pixel value is represented by 6 bits, 
i.e., pixel values range between 0 and 63. 





a. Find @ and B in order to stretch the dynamic range of this image to the 
range [5,60] using the automatic gain control (AGC) method. Also, find 
the output image. 

b. Apply histogram equalization to the above image and compare the result 
with part (a). 


3.5 The horizontal Sobel filter is given by 





Find the impulse response of the combined smoothing and Sobel filter. 


3.6 In the Wiener filter given by Eqn. (3.68), how would you estimate the power 
spectrum of the original image P (f> fz) and the noise P (fof)? 


3.7 Let s denote the lexicographic ordering of pixel intensities in an image. Show 
that convolution by an FIR filter kernel A[7,,7,] can be expressed as y = Hs, 
where H is block-Toeplitz and block-circulant for linear and circular convolu- 
tion, respectively. Write the elements of H in terms of A[z,,7,]. 


3.8 Show that a block-circulant matrix can be diagonalized by a 2D-DFT. 
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39 


3.10 


3.11 


312 


Suppose we have a 7 X 1 (pixels) uniform motion blur; i.e., h{n,,n,] isa 1-D 
boxcar function that is seven pixels long. How many elements of H(u, u,) 
in Eqn. (3.91) are exactly zero, assuming a 256 X 256 DFT? Repeat for an 
8 X 1 (pixels) uniform motion blur. Show, in general, that exact zeros will be 
encountered when the DFT size is an integer multiple of the blur size. 


Show that the CLS filter (3.101) becomes a Wiener filter when L = RR, 


The term æ|L(u,, u,)|? in Eqn. (3.101) limits noise amplification; however, it 
causes a different type of artifact known as a “regularization artifact” due to 
deviation of the regularized filter from the exact inverse. Provide a quantitative 
analysis of the tradeoff between noise amplification and regularization arti- 


facts. (Hint: see [Tek 90].) 


Iterative signal and image restoration based on variations of the Landweber 
iteration has been heavily examined [Sch 81]. Discuss the relationship between 
the Landweber iterations and the POCS method. (Hint: see [Tru 85].) 


MATLAB Exercises 


3.1 


3.2 


Bi-lateral Filter: Implement the bi-lateral filter defined by Eqns. (3.6) and (3.7): 

a. Comment on how filter kernel and output varies for different values of N, 
o?, and a?. 

b. Compare filter kernel and output with a Gaussian filter with the same 
parameters. 

c. Discuss how to implement fast bi-lateral filtering. 


Gaussian and Laplacian Pyramids (Decimation and Interpolation): 
a. To construct a Gaussian pyramid: 

i. Definea7 X 7 circularly symmetric Gaussian filter to construct a 3-level 
Gaussian pyramid. First, apply the Gaussian filter to the image, then 
subsample the output by a factor of 2 in both the horizontal and vertical 
directions to form the second level of the pyramid. Repeat the process to 
form the third level. 

ii. Repeat (i) using 7 X 7 box-car filter, i.e., (1/49) ones (7,7). Compare pyra- 
mids formed by using the Gaussian and box filters and discuss the results. 

iii. Now construct a resolution pyramid without any low-pass filtering. 
What is the problem with not using a filter? Explain in detail by refer- 
ring to the frequency domain. 
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b. A Laplacian pyramid is constructed by taking the difference between the 


Gaussian filtered image and the original image to compute a difference 
image at each level of a Gaussian pyramid as depicted below. Thus, a 4-/evel 
Laplacian pyramid consists of one low resolution picture and three differ- 
ence images at the resolution of respective levels, denoted by {hp 4,,4,,f5}- 





holi, j) hili, j) h(i, j) 


i. Perform interpolation using nearest-neighbor, bilinear, and bicubic 
spline interpolator filters. (Hint: Use the interp2 function in MATLAB.) 
Compare results. 

ii. In theory, we can recover the full-resolution image by adding the differ- 
ence images to the interpolated images at each level successively. But for 
the sake of this exercise, do not add the difference images. Just interpo- 
late the low-resolution image by a factor of 4 in each direction, and com- 
pare the resulting image with the original high-resolution image. What 
differences do you see? Explain by referring to the frequency domain. 


3.3 Edge Detection: Take two color images. Convert these images from the RGB 


into the YCrCb domain: 
a. Apply Sobel, Prewitt, and LoG filters on the Y channel only and find an 


edge map by thresholding the resulting magnitude of the image gradient 
appropriately. Which operator gives the best result? Explain and discuss the 
results. 


. Now lets perform edge detection using the Canny edge detector (use the 


Canny function in MATLAB with appropriate parameters). Compare the 
result with the edge map found in (a). Explain and discuss the differences. 


c. Add Gaussian noise to the luminance image (you can use the imnoise func- 


tion in MATLAB), and find the edges as described in (a) and (b) above. 
Which method is the best in the presence of noise? Explain why. 


3.4 Enhancement of Color Images: Take two color images. Convert these images 


from RGB to YCrCb domain. 
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Perform the following image enhancement operations on the Y channel 
only: 

i. Automatic gain control 

ii. Histogram equalization 

Convert the processed images back to the RGB space for displaying the 
results. Find the parameters that yield the best results for AGC and unsharp 
masking. Specify these values with proper justification in your report. 


. Implement unsharp masking 


i. Using simple Gaussian or box low-pass filtering 

ii. Using bi-lateral filtering instead of low-pass filtering as in Durand and 
Dorsey [Dur 02] 

Compare the results. 

Now apply unsharp masking to the Y Cr, and Cb channels independently. 

Convert the processed images back to the RGB space for displaying the 

results. Compare the result with the one obtained above. Explain and com- 

ment on the differences clearly. 


3.5 Image Denoising (Noise Filtering) 


a. 


b. 


C。 


Add white Gaussian noise with 10 dB SNR to an image to obtain 
(npn) = s(n,,n,) + vn1, n) 
Implement the IIR Wiener filter 


Hy (kk) = a aioe : 
Skok) +o, 

using the 2D-DFT, where the power spectrum of the original image can be 
estimated by 


1 
S, (hk,k) =F Ye)! 


and the variance of the noise can be estimated as the sample variance of a 
flat region, such as sky or piece of a flat wall, in the noisy image. 
Implement the adaptive FIR Wiener filter 


A2 

A A oO n > A 

imm) = Â, (mm) +) y(n.) f(s) 
C (m,m) 
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where the local sample mean over a (2M, + 1) X (2M, + 1) window about 
(n,,n,) can be estimated as 


1 m+M! m+M 


DEE E 


O(n; n,) is estimated by the local sample variance as 


m+M, 
IVES IVS ee > i, ,i,) — fu. (n,,2,)] 
OM ON ieee ,)— Â, (mm, 


3 Í m+M, 
(mm) = 


and the variance of the original image FFl n,) can be estimated from the 
noisy image y(7,,7,) as 


O° (n,,n,) = max {0,6 (7,,7) get 


d. Compare the results of (b) and (c) and write your conclusions. 


3.6 Wavelet Shrinkage 
a. Add white Gaussian noise with 10 dB SNR to an image to obtain 


Jn, n) = s(n,,n,) + vln, n) 


b. Compute two-level wavelet decomposition of the image (7,,”,) similar to 
that shown in Figure 3.15 using the MATLAB function dwt for 1D-wave- 
let transform. 

i. Use Haar wavelet (wname= 'db1') 

ii. Use an orthogonal wavelet (wname= 'sym8') 

iii. Use Daubechies 5/3 integer wavelet (wname='rbio3.5') 

Display the resulting images. Can you see any differences between the trans- 
forms? Comment. (P.S. Write your own code using 1D-wavelet decomposi- 
tion of first rows and then columns.) 

c. Apply soft-thresholding in the orthogonal wavelet transform domain and 
reconstruct the filtered image. Display the filtered image and measure its SNR. 


3.7 Image Restoration ` 
a. Apply 5 X 5 uniform out-of-focus blur to an image; then add white Gauss- 
ian noise with 30 dB SNR to the blurred image to obtain the noisy and 
blurred image y(n1, 7,). 
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b. Compute |Y(k_, k,)|? and display the result as a picture. Can you observe 
the loci of zero-crossings? How does the picture change if you increase the 
SNR to 40 dB? 

c. Apply the inverse filter (3.97) in the DFT domain to restore the original 
image. How do you handle image borders? What does the result look like? 
Explain why. 

d. Apply the CLS filter (3.101) in the DFT domain to restore the original 
image. How do you handle image borders? How does the result change for 
different choices of |L(k,,4,)|* and different values of a? Try at least two 
different |L(&,, £,)|? and three different values of a for each |L(&,, &,)|?. 

e. Comment on the tradeoff between noise amplification and regularization 
artifacts. 


MATLAB Resources 


Xin Li, Reproducible Research in Computational Science 
http://www.csee.wvu.edu/~xinl/source.html 


Xiawen Chen, Fast Bi-lateral Filter and Local Histogram Equalization 
http://people.csail.mit.edu/jiawen/#code 


Alessandro Foi, BM3D MATLAB Code 
hetp:/ /www.cs.tut.fi/~foi/GCF-BM3D/index.html#ref_software 


Peter Kovesi, MATLAB Functions for Computer Vision and Image Processing 
http://www.csse.uwa.edu.au/~pk/research/matlabfns/#hysthresh 


CHAPTER 4 


Motion Estimation 





Motion estimation refers to estimating 2D image-plane motion (correspondence or 
optical flow) or 3D motion (object motion or pose). It is a fundamental problem in 
video processing (motion-compensated filtering/compression) and computer vision. 


Video is a time-varying two-dimensional (2D) spatial-intensity pattern that is formed 
by projecting a three-dimensional (3D) dynamic scene into a 2D image plane. Tem- 
poral variations in the 2D intensity pattern are usually due to relative 3D motion 
between a camera and objects in the scene. This chapter presents 3D models (in most 
cases, simplistic ones) for relative motion between a camera and a rigid (static) scene 
and 2D models for temporal variations of spatial-intensity patterns (pixel motion) 
in the image plane resulting from such rigid motion as well as independent non- 
rigid motion of objects in the scene. We start by modeling image formation, which 
includes different projection models in Section 4.1. Section 4.2 introduces motion 
models, including 3D rigid motion and 2D pixel motion models. We formulate the 
2D pixel motion-estimation problem in Section 4.3. 2D motion-estimation methods 
can be broadly classified as: i) differential methods that require estimation of spatial- 
and temporal intensity gradients, ii) search-based matching methods including block- 
matching and its variations, iii) nonlinear optimization methods including pel-recur- 
sive and Bayesian methods, and iv) transform-domain methods, which are discussed 
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in Section 4.4 to Section 4.7, respectively. Besides being an essential component 
of image registration/rectification and motion-compensated filtering/compression, 
2D-motion estimation is often the first step toward sparse or dense 3D-motion esti- 
mation, which is introduced in Section 4.8, from monocular/stereo video. 


4.1 Image Formation 


“Image formation” refers to mapping a 3D scene into an image-plane intensity pat- 
tern, which includes geometric projection and photometric effects of motion. Geo- 
metric image formation considers camera models for projecting a 3D scene into 
a 2D image plane, discussed in Section 4.1.1. Photometric image formation that 
models image intensity variations due to changes in the scene illumination in time as 
well as the photometric effects of the 3D motion is covered in Section 4.1.2. 


4.1.1 Camera Models 


Imaging systems capture projections of a time-varying 3D scene onto the 2D image 
plane. Commonly used camera models are projective camera, which employs perspec- 
tive projection, and affine camera, which covers orthographic, para-perspective, and 
weak-perspective projections. 


Projective Camera 


Perspective projection models exact image formation according to the principles of 
geometrical optics using an ideal pinhole camera, where all rays from the object pass 
through the center of projection, i.e., the center of the lens. For this reason, it is also 
known as “central projection.” Perspective projection is illustrated in Figure 4.1 for 
two configurations: a) the image plane (x,,x,) coincides with the (X,,X,,0) plane of 
the scene (world) coordinate system, and the center of projection (lens) is between 
the object and image planes; and b) the center of projection coincides with the origin 
of world coordinates. 

The relations that describe the perspective projection for the configuration in 
Figure 4.1(a) are 


=— 4 — and 


T ET, f &-f 


or 
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image plane (X,,X2,X3) 






(XXX) 
la ae (5: SC) 


(a) (b) 


Figure 4.1 Perspective projection: (a) image plane coincides with the (X,, X», 0) plane of the world 
coordinate system and (b) center of projection coincides with the origin of the world coordinates. 


peer iB and x = is (4.1a) 


> X; f — X; 
where fdenotes the focal length of the camera (distance from the center of projection 
to the image plane), which can be obtained based on the similar triangles formed by 
drawing perpendicular lines from the object point (X,,X,,X,) and the image point 
(x1, x,, 0) to the X, axis, respectively. 

If we consider a different configuration where the center of projection coincides 
with the origin of the world coordinates and the image plane is between the object 
point and camera, which is depicted in Figure 4.1(b), a simple change of variables 
yields the following equivalent expressions: 


eee cad eet (4.1b) 
x, X 


3 


The similar triangles used to obtain these expressions are shown in Figure 4.1 (b). 
Observe that the expressions (4.1b) can be employed as an approximate model for 
the configuration in Figure 4.1 (a) when X, >> fwith the reversal of the sign because 
image orientation is the same as that of the object, as opposed to being a mirror 
image as it should be in actual image formation. 

The perspective projection is nonlinear in the Cartesian coordinates since 
it requires division by the X, coordinate. However, it can be expressed as a linear 
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mapping in the homogeneous coordinates, where 3D scene points and 2D image 
points are represented by 4- and 3-vectors, respectively, given by 


Xi 
Xha 
h,2 
: Ap3 
h 
Xs 








x x 
bl _ X42 
= -一 and x, =——, x, =— (4.3) 

Xa Xs Xs Xp,3 Xp,3 


is called dehomogenization. Then, expressions (4.1) can be written in a linear form 


X, 

Ax, fo ooy 
Ax,=|Ax,|=| 0 f 00l 
À O E Wt | a 

1 


where 入 is a scale parameter (constant). It is not possible to recover the scale param- 
eter without knowing some metric measurement about the scene. This linear rela- 
tionship can be rewritten as 


X, 

Ma| | f 0 Of1 00 Olly 
Ax,=|Ax,|/=| 0 f 0|/0 10 0 ae (4.4) 

À 2 0 


where the 3 X 3 matrix is called the camera calibration (intrinsic parameter) matrix 
and the 3 X 4 matrix is called the extrinsic matrix, which shows the relative position- 
ing of camera and world coordinate systems. In this case, the matrix indicates that 
the camera and world coordinates are aligned. 

The most general form of the perspective projection can be expressed by modify- 
ing Eqn. (4.4) to include other intrinsic camera calibration parameters and arbitrary 
camera pose, i.e., positioning (rotation and translation) of the camera with respect 
to the (scene) world coordinates, as [Ver 89, Zha 04] 
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Xx, 

Ax, fe, s X10 nl 2 ñs A re 

Ax, =|Ax,|=| 0 fk, x29 mM Mm h h x 

A 0 O | % % 4&5 站 
=K{R|t] X, =P X, (4.5) 


where P = K [R | t] is a 3 X 4 projection matrix that models geometric image forma- 
tion in the most general case. The 3 X 3 matrix K is the camera calibration matrix 
with 5 degrees of freedom, where (x, o %30) denotes the coordinates of the center of 
the image, s= cot 0 is a skewness parameter, where 0 is the angle between x, and 
x, axis, fis the focal length of the camera, and k,, k, denote the pixel aspect ratio. 
These parameters are illustrated in Figure 4.2. The 3 X 3 matrix R is a rotation matrix 
and the t is a 3 X 1 vector, which model the rotation and translation of the camera 
coordinate system with respect to the scene (world) coordinate system, respectively. 
Diferent representations for a rotation matrix, with 3 degrees of freedom. 


Affine Camera 


Affine camera includes the orthographic, weak-perspective, and paraperspective pro- 
jection models. 


Orthographic Projection 


Ortographic projection is a simple approximation of the actual imaging process 
where it is assumed that all rays from the 3D object (scene) to the image plane travel 


Image plane x 
Scence (world) 
coordinate system 


x 








Camera 
coordinate 
system 





Center of 
projection 


Figure 4.2 Perspective projection model for arbitrary camera-pose and 
camera-calibration parameters. 
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Figure 4.3 Orthographic projection model. 


parallel to each other. Hence, it is sometimes called the “parallel projection,” which 
is illustrated in Figure 4.3, where the image plane is taken parallel to the (X,,X,, 0) 
plane of the world coordinate system. 

In this configuration, the orthographic projection can be described in the Car- 
tesian coordinates by 


eS and 2, =X, 


or in vector-matrix notation as 








Xx, 
7/1 60 
i $: 10 [x ee) 
3 


where (x,,x,) denote the image-plane coordinates. 

In the orthographic projection, the distance of the object from the camera does 
not affect image-plane intensity distribution. That is, the object always yields the 
same size image no matter how far away it is from the camera. Orthographic projec- 
tion provides a close approximation to the actual image formation when the distance 
of the object from the camera is much larger than the depth range of points on the 
object. In such cases, the orthographic projection may be preferred over more com- 
plicated but realistic models because it leads to algebraically and computationally 
more tractable algorithms. 
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Weak-Perspective and Paraperspective Projections 


Weak-perspective projection is a scaled-orthographic projection that offers a com- 
promise between the simplicity of the orthographic projection and realism of per- 
spective projection. It is given by 


xX, 
2/8/1090 = 4.7 
ar lo 1 o MX (4.7) 
5 








where fis the focal length of the camera, Z denotes the depth (X, value along the 
optical axis) of a reference point in the 3D object, and the 2 X 3 matrix M is called 
an affine camera matrix. 

The paraperspective projection extends scaled-orthographic projection by model- 
ing the perspective deformation of an object with respect to a nearby reference plane 
that is parallel to the image plane. 


4.1.2 Photometric Effects of 3D Motion 


Image-pixel intensities are modeled proportional to the amount of light reflected 
by the objects in the scene. The scene reflectance function is generally assumed to 
contain a Lambertian and a specular component. Here, we assume that the specular 
component can be neglected. Such surfaces are called Lambertian surfaces, i.e., sur- 
faces whose appearance does not vary with viewpoint. More sophisticated reflectance 
models can be found in [Lee 90]. 


Lambertian Reflectance Model 


If a Lambertian surface is illuminated by a single distant-point source with uniform 
intensity (in time), the resulting image intensity is given by Lambert’s cosine law 


[Lee 90, Pen 91]: 

s, (xxt) =p N(t)-L (4.8a) 
where p denotes surface albedo, i.e., the fraction of the light reflected by the surface, 
L= (L L» L) is the unit vector in the mean illuminant direction, and N(D is the 


unit surface normal of the scene, at spatial location (X,,X,,X,) and time ż, given by 


N(t) = PD (4.8b) 
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s(x Xpt) image intensity 






N(ġ surface normal 


L 


illumination 


Figure 4.4 Photometric image-formation model. 


in which p = 6X,/6x, and q=6X,/6x, are the partial derivatives of depth X,(x,,x,) 
with respect to the image coordinates x, and x,, respectively, under the orthographic 
projection. Photometric image formation is illustrated in Figure 4.4. 

The illuminant direction can also be expressed in terms of tilt and slant angles 
as [Pen 91] 


L=(L,L,,L,) = (cost sing, sin7 sina ,cosa) (4.9) 


where 7, the tilt angle of the illuminant, is the angle between L and (X,,X;) plane, 
and ø, the slant angle, is the angle between L and the positive X, axis. 

As a result of relative 3D motion between the scene surface and camera, the sur- 
face normal varies by time; so do the photometric properties of the surface. Pentland 
[Pen 91] shows that the photometric effects of motion can dominate the geometric 
effects in some cases. 


4.2 Motion Models 


This section discusses modeling of relative motion between a camera and a scene. 
Tne 2D image-plane (pixel) motion is the projection of 3D scene motion, which 
may be due to 3D motion of the camera, change in camera calibration parameters, 
and/or independent motion of one or more moving objects in the scene. However, 
2D “projected motion” may not always be observable from a video (time-varying 
image intensity) for reasons that are discussed below. Instead, what we observe is 
the 2D “apparent motion” (optical flow or correspondence). We start by defining 
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the projected and apparent motion fields and discussing various ambiguities that are 
inherent in them in Section 4.2.1. Models for the projected motion are presented in 
Section 4.2.2. 2D apparent-motion models are covered in Section 4.2.3. 


4.2.1 Projected Motion vs. Apparent Motion 


Projected motion refers to the 2D displacement (velocity) field, i.e., projection of 
the respective 3D vectors into the image plane. In order to estimate the projected 
motion (together with sparse/dense scene structure) we need sparse/dense corre- 
spondence or 2D displacement/optical flow (“apparent motion”), which is estimated 
from the observed time-varying image intensity (video). 


Projected Motion 


A complete treatment of projected motion models can be found in [Sze 06]. We 
investigate some cases of common interest, which are often treated under similar 
problem formulations and solution methods, in the computer-vision and video- 
processing communities. 


Case 1(a): Camera Motion in a Static Scene or Stereo Vision 


A static scene captured by a moving monocular camera arises in such problems as 
robot vision for autonomous navigation in static environments and 3D environment 
modeling by a handheld camcorder, which are often tackled by the computer-vision 
community. The imaging geometry for this case is depicted in Figure 4.5, where 
Camera 1 refers to camera position at time t and Camera 2 refers to camera position 
at time ¢’. This configuration is identical to that of stereo vision, where Camera 1 and 
Camera 2 refer to the left and right cameras, and motion estimation becomes pose 
estimation. In stereo vision, we can lift the requirement that the scene be static, since 
time is frozen and left and right cameras see the exact same scene even in the presence 
of independently moving objects. 


Projective Camera If we have M cameras (multiple frames of video or multi-views) 
and WV sparse feature points that are assumed to be visible in all M frames (views), 
then the projected image-plane coordinates of the feature points in the homoge- 
neous coordinates are given by 


A,X, =P, X,i=0,...,M@—1,j=0,...,N—1 (4.10) 


where X, denotes the j feature point in the scene (world) coordinates, P, is the 
3 X 4 projection matrix for the i** camera, and À; are scalars. Let the initial camera 
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Scene (world) 
`J_coordinate system 


Camera 2 


Figure 4.5 Camera motion with a static scene or stereo vision. 


matrix be P, = K[L x3 l0; ,], then the relative pose (motion) between the camera 
i and camera 0 (the 3D camera coordinates) can be modeled by a rotation matrix R, 
and a translation vector T, i=1,...,M—1. 

There is ambiguity in “projected motion” up to an invertible projective transfor- 
mation H, since 


Es 一 1 — A 
A, x, = (P,A)(H X,J=PX, $= 1, 01 pM, = IV (4.11) 


Hence, any pair of camera matrices P,H and shape (structure) matrices H~ 'X, yield 
the same set of projected image-plane (pixel) coordinates and are projectively equiva- 
lent [Har 04]. 


Case 1(b): Rigid Scene Motion with a Static Camera 


If we consider the motion of each object point X independently, then the 3D dis- 
placement vector field can be represented by a set of 3D translation vectors, one for 
each point. Let a 3D feature point X at time ż move to position X’ at time ¢’ and 
projections of X and X’ into the image plane be denoted by pixels x and x', respec- 
tively. The projection of the 3D motion (scene flow or 3D displacement) vector XX’ 
into a 2D motion (optical flow or 2D displacement) vector xx’ in the image plane is 
illustrated in Figure 4.6. When we consider the motion of each object point X inde- 
pendently, all 3D displacement vectors whose tips lie on the dotted line X’x’ give the 
same 2D displacement vector because of the perspective projection. 
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Figure 4.6 Projection of the 3D scene motion vector onto the image plane. 
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If the entire field of view consists of a single 3D rigid object or a moving 3D rigid 
object is segmented from the background, then relative 3D positions of all object 


points at times ¢ and # are related by a single rotation matrix R and a translation 


vector T, given by 


/ 
xi X, fis Sx Sy 1) 1% 
xX; =R X% tIS 4 各 a |z tI 
X? X, hi % ha ||X, i, 


which can be expressed in the homogeneous coordinates as 


hi Vig Ty d 

fe. Po. Tox, ale 

/ Zi. +22 2. 2 
x= xX 


ty Bz % I; 


eo 0 1 


(4.12a) 


(4.12b) 


where X and X’ are the homogeneous coordinates of an object point at times ¢ and 
t', respectively. We note that this case is equivalent to Case 1(a), where X’ and X 
denote the same scene point with respect to two different camera coordinates that 


are related by Eqn. (4.12). 
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Case 2: Non-Rigid Scene Motion or Multiple Motions with 
Possible Camera Motion 


This is a typical case in general video processing, such as motion-compensated fil- 
tering and compression of generic broadcast TV or surveillance videos. If there 
are multiple rigid motions or non-rigid 3D motion in the scene, Eqn. (4.10), 
which assumes a rigid 3D transformation (4.12) between the scene and camera 
coordinates, is violated. We do not attempt to model non-rigid 3D motion in 
this book; interested readers should refer to [Ter 88, Tan 94]. This case may be 
addressed by segmenting the scene into multiple objects, each exhibiting a single 
rigid 3D motion or directly modeling resulting 2D “apparent” motion (discussed in 
Section 4.2.3). 


Apparent Motion — Optical Flow 


Apparent motion refers to correspondence (displacement) or optical flow (velocity) 
field that is perceived (can be observed) from the time-varying image intensity pat- 
tern (video). We observe that “apparent motion” (correspondence or optical flow) is, 
in general, different from the “projected motion” (displacement or velocity) field due 
to following ambiguities [Ver 89]: 


。 Lack of sufficient spatial-image gradient: There must be sufficient gray-level 
(color) variation within moving regions for the actual motion to be observable. 
An example of an unobservable motion is shown in Figure 4.7, where a circle 
with uniform intensity rotates about its center. This motion generates no opti- 
cal flow, and thus is unobservable. 

We often perform motion estimation over small blocks of pixels, called 
a finite aperture. When this finite aperture does not contain sufficient image 
gradient, the motion is not observable within that aperture, which is referred as 
the aperture problem (see Section 4.3.4). 


Figure 4.7 All projected motion does not generate optical flow. 
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Figure 4.8 All optical flow does not correspond to projected motion. 


。 Changes in external illumination and or shading: An observable optical flow 
may not always correspond to an actual motion. For example, if the intensity 
and/or direction of external illumination vary from frame to frame, as shown 
in Figure 4.8, then an optical flow will be observed even though there is no 
motion. Therefore, changes in the external illumination impair estimation of 
the actual 2D motion field if it is not properly modeled. 

In some cases, the shading may vary from frame to frame due to 3D motion 
of objects even if there is no change in the external illumination. For example, 
if an object rotates, its surface normal changes, which results in a change in 
the shading. This change in shading, discussed in Section 4.1.2, may cause 
the intensity of the pixels along a motion trajectory to vary, which needs to be 
taken into account. 


Because 3D and 2D motion estimation problems are both ill-posed due to 
various ambiguities that are discussed, motion estimation methods need additional 
assumptions (models) about the structure of the 2D motion field for regulariza- 
tion of the problem, which are discussed next. General discussion of deterministic 
smoothness models and Markov random field models can be found in Appendices 
B and C, respectively. 


4.2.2 Projected 3D Rigid-Motion Models 


According to classical kinematics, 3D motion can be classified as rigid vs. non-rigid 
motion. In rigid motion, relative distances between a set of 3D points remain fixed as 
the scene evolves in time. This section presents exact models to describe the projec- 
tion of relative rigid 3D motion of a set of object points and a camera, which can be 
derived for Cases 1(a) and 1(b) (discussed earlier). 
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General Model (Depth Map) 


If we make no assumptions about the scene structure and take the depth of each 
feature point X, as an independent variable, then projecting the 3D feature point 
position X’ given by (4.12) into the image plane using (4.1), we have 


JE fX = fX, tnaX, +73X3 +T) 


1 
K %X, +X, + Xs tT; 

mam fa = f aX, tX th X +T) 
i X, 1X, +X, +H3X, 十 73 


Dividing the numerator and denominator of both expressions by X}, we have 


K 
HX, t+ nsx + fns +i 
OO (4.13a) 


站 T 
1X, 十 72X2 Fha + Ye, 
rats tate + ity +t Ty 
x, 一 Se (4.13b) 


Y. 
B1X1 十 52X2 +H; + Ye, 


where the depth X of each feature point appears as a parameter in these expressions. 
Here, the six 3D motion parameters (three rotation and three translation) constrain 
the direction of 2D image motion (displacement or flow) vectors, while the depth 
parameter is required to determine the exact value of the 2D motion vector. 


Homography (Perspective Model) 


If the 3D structure (shape) of a moving object can be modeled by a non-deformable 
surface, e.g., a planar or piecewise planar surface, to relate the depth of object points, 
then the number of free depth parameters can be reduced. Let the set of 3D feature 
points lie on a plane, described by 


aX, +bX, +cX, =1 


where [a b da’ denotes the normal vector of the plane. Since the right-hand-side 
is unity, the 3D displacement model (4.4) can be rewritten as 
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pE X DE Xx, X, 
X!|=R|X,|-+T=R|X,|+T|[a b JX ||=Hİ|X, 
x X, x; X; X; 


where H=R+Tla 6 d. Now, projecting 3D scene coordinates into the 2D 
image plane, following steps similar to those used to obtain (4.5), we have the 
homography relations, which model perspective projection of 3D motion of a pla- 
nar surface, given by 


+h,x,+ 
fe ale 半 放 < | (4.14a) 

Mi TEX TB, 
1 _ Agr thx + he (4.14b) 

b,x, thx +h, 
where /, is sometimes set equal to 1 in order to account for the scale ambiguity. Eqn. 
(4.14) is also known as the perspective model. 

We note that if T = 0, we have H =R and the homography (4.14) provides an 
exact mapping between two image frames (i.e., compensating for camera calibration 
and rotation) regardless of the scene geometry. In summary, two frames are related 
by a homography if and only if 


1. they are views of the same 3D planar surface from different camera positions 
(with rotation and translation). 

2. they are captured by the same camera where the camera is only allowed to rotate 
about its optical center and/or zoom (without any translation). This case is 
independent of the scene structure. 


If there are multiple planar objects with different motions or when the surface of 
a single moving object is approximated by a piecewise planar model, then a different 
parameter set, 4,,..., Ag, is required to describe the motion of pixels for each planar 
piece, which is often called a layered scene representation (see Chapter 5). 


Residual Planar-Parallax Motion Model 


While the general model (4.5) measures feature point depth X, with respect to the 
camera coordinates, the depth can also be measured with respect to a reference plane 
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in the scene. A reference plane that is visible in both frames can be aligned using the 
homography given by (4.14), which compensates motion due to change in camera 
calibration and camera rotation. Any residual motion is due only to camera trans- 
lation and deviation of the scene structure from the reference plane. This residual 
image-plane motion, called planar parallax motion, is a radial vector field centered 
at the epipole (focus of expansion) e = [e, e, e3] T in the homogeneous coordinates. 
The image-plane displacement in homogeneous coordinates can be decomposed as 


/ = / 
x == (x —x,,) +, 3) 
where x,, is obtained by warping x to the coordinate system of the second camera 


using the homography (4.14). Here, x’ —x,, represents the planar-motion compo- 
nent. The residual (planar parallax) motion x,,— x can be modeled by 





fa 
—x=-— 一 4.15 
二 一 Eros (exse) ( ) 


H 
where Y =- and H denotes the vertical distance of the 3D point X from the 


reference plane in the scene, which is a shape-invariant feature. The derivation of 
(4.15) can be found in [Ira 02, Ira 98]. 


4.2.3 2D Apparent-Motion Models 


This section introduces parametric and non-parametric 2D apparent-motion models 
that are either approximations to the projected motion model or aim to impose a 
local smoothness constraint to regularize apparent motion estimation. 


Parametric Models 


Parametric models aim to describe 2D apparent motion (displacement or flow) of a 
video frame or a block of pixels with a small number of parameters. Assuming there 
is no occlusion, the & + 1* frame of a sequence can be expressed as s, , ,(x) = s,(x’), 
where x’ = h(x; O) is a transformation of pixels from frame k to k+ 1, given the 
parameter vector ®©. The transformation /4(x;@) must be unique and invertible. 
Homography, discussed in Section 4.2.2, is an example for parametric models with 
8 degrees of freedom (free parameters); however, it is nonlinear in the parameters 
since it involves division. We introduce simpler models and linear approximations to 
homography in this section. 
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Block-Translation Model 


The simplest motion model is block translation, which assumes a frame is com- 
posed of moving blocks, whose motion can be characterized by a translation vector 


d= (d,, d,)T, i.e., 
x =x% +d, (4.1Ga) 
x, =x, +d, . (4.16b) 


where (x/,x,) denotes coordinates of pixel (x,,x,) in the reference frame & ¥ /. It 
is used in video-compression standards, since it is simple and effective for model- 
ing small motions that can be approximated by translation. Two motion-estimation 
methods specifically designed for estimating block translation are block-matching 
and phase-correlation, discussed in Sections 4.5 and 4.7, respectively. 


Affine Model 


The orthographic projection of 3D rigid motion of a planar surface can be described 
by a six-parameter affine model. It can be applied to an image frame or just a block 
of pixels, given by 


x = a,x, +a,x, +4, (4.17a) 
x, = a,x, +a,x, +a, (4.17b) 


It provides a good approximation to homography if the distance of the planar surface 
from the camera is large enough so that all rays from the planar object to the camera 
can be assumed parallel. Special cases of the affine model include the following 2D 
image (pixel) motions: 


1. Pure translation: If 2, = 4; = 0 and a, = a, = 1, then Eqn. (4.17) reduces to (4.16), 
with a, = d, and đ; =d; 
2. Pure rotation: If a, = 0, a, = cos 0, a, = sin 0, a> cin, and a, = Cos 0, 
then 
x| = x, cos + x, sind 


/ 。 
x, = =y sin? +x, cos? 


models rotation of the axis in the image plane by angle 6. 
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3. Isometry refers to combination of rotation and translation with 3 degrees of 
freedom, 0, az, aç- 
4. Isotropic scaling (Zoom): If a, = a; = k and a, =a,=a,=a,=0, then 


A 
x, =k x, 


oo 
x, =k x, 


5. Similarity transformation refers to a combination of rotation, translation, and 
scaling: 


xf = kx, cos0 + x, sin + d, 
x, = —x sin0 + kx, cos0 +d, 
In its most general form, the afne transformation (4.17) preserves parallel lines 


in the image plane, i.e., two parallel lines in frame & are mapped to parallel lines in 
frame k + 1. 


Linear Approximations to the Perspective-Motion Model 


The bi-quadratic, pseudo-perspective, and bilinear models are linear approxima- 
tions to homography (perspective model). The bi-quadratic model, with 12 free 
parameters, 

x = MX + a,x, ta tax + asx +a (4.18a) 


x, = ty Hax, + ay taok Hax tax, (4.18b) 


can be obtained by a Taylor series expansion of the homography. The pseudo- 
perspective model, with eight free parameters, given by 


x = a,x, tax, +a, +a,x, +4,x,x, (4.19a) 
x, = a,x, + a,x, +a, +a,x,x, + a,x; (4.19b) 


is an instantaneous flow approximation. The bilinear (pseudo-perspective) model, 
given by 


x = a,x, +a,x, +a,x,x, +4, (4.20a) 


4.2 Motion Models 209 


M = aK, Fet + ay XX, T (4.20b) 


is another linearized approximation. Parameters of these models can be estimated 
directly from image intensity (Section 4.4.1) or from given/pre-computed feature 
correspondences (Section 4.5.5). 


Non-Parametric Models 


Unlike parametric models, non-parametric models can be used to estimate 2D 
motion, which is projection of non-rigid and deformable 3D motion. Non-parametric 
models impose smoothness constraints on the estimated 2D-motion field without 
any assumptions about the nature of the underlying 3D motion and scene structure. 
Use of smoothness constraints to regularize solutions of ill-posed problems is well- 
known in science and engineering [Ber 88] (see Appendix A). Te non-parametric 
models can be classified as deterministic vs. stochastic models. 


Deterministic Models 


Deterministic models impose a smoothness constraint on the 2D-motion field, 
which requires that motion vectors vary slowly from pixel to pixel or block to block 
over a spatio-temporal neighborhood. They may take several forms: i) In differential 
methods, imposing a global or local smoothness constraint on the solution of the 
optical flow equation requires solution of a variational problem [Hor 81] (Section 
4.4). Because a global smoothness constraint causes inaccurate motion estimation 
at motion/occlusion boundaries, more advanced directional smoothness constraints 
that allow for sudden discontinuities in the motion field have also been proposed 
[Nag 86]. ii) In the block-matching method (Section 4.5), commonly used in video- 
compression standards, the search can be initialized at a pixel pointed by the estimate 
from one of neighboring blocks. iii) Pel-recursive methods are predictor-corrector 
type displacement estimators (Section 4.6), where the prediction at each pixel can 
be taken as the motion estimate at the previous pixel or as a linear combination of 
estimates in a neighborhood of the current pixel. Hence, the prediction step can be 
considered as an implicit smoothness constraint. Wiener-type estimation extends 
this concept to block-based recursive-motion estimation. 


Probabilistic Smoothness Constraints 


Bayesian motion-estimation methods utilize probabilistic smoothness constraints, 
usually in the form of a Gibbs random field, where smoothness of the displace- 
ment feld is quantifed in terms of some energy functions (see Appendix B). It is 
also possible to impose directional smoothness constraints within this framework by 
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defining line fields [Dub 93]. The main drawback of such methods is the extensive 


amount of computation that is required. 


4.3 2D Apparent-Motion Estimation 


While it is the projected “true” motion that is desired in most computer-vision prob- 
lems, estimation of apparent motion is often sufficient for most video-processing 
applications such as motion-compensated filtering and video compression. This sec- 
tion first defines various formulations of the 2D apparent-motion estimation prob- 
lem and clarifies the distinction between them in Section 4.3.1. Motion estimation 
from one or more frames relies on the principle (assumption) that image intensity 
remains constant along the true-motion path (trajectory). This constraint is math- 
ematically stated in the form of the optical flow equation (OFE) in Section 4.3.2 
and in the form of displaced frame difference (DFD) in Section 4.3.3. Section 4.3.4 
discusses ambiguities in pixel-based estimation of 2D apparent motion. The concept 
of hierarchical-motion estimation is introduced in Section 4.3.5. Finally, Section 
4.3.6 presents measures to assess motion estimation performance. 


4.3.1 Sparse Correspondence, Optical-Flow Estimation, 
and Image-Registration Problems 


The 2D “apparent motion” estimation problem can be posed as sparse correspon- 
dence estimation or dense displacement/velocity (optical flow) estimation or global 
image registration problem. 


Sparse-Correspondence Estimation 


The displacement of an image feature or pixel from image coordinate x at time ¢ to 
x’ at time ¢’ results in a displacement (correspondence) vector d(x, A =x’ — x. The 
sparse-correspondence estimation problem can be posed as finding the displacement 
vectors d(x, z) between two or more frames at some pre-determined isolated “good” 
feature points xX, j= 1, ... , N. Good feature points are typically corner points or pix- 
els of interest that can be uniquely matched in two or more frames. That is, there is a 
sufficient image gradient in their neighborhood and they are visible in the frames of 
interest (see aperture and occlusion problems in Section 4.3.4). Detection of “good” 
feature points, with sufficient image gradient in their local neighborhood, has been 
discussed in Section 3.3.4. Sparse feature correspondence estimation is usually the 
first step in 3D motion and structure estimation (see Section 4.8). 
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Dense-Motion (Optical-Flow/Displacement) Estimation 


Apparent flow of image intensity pattern on a lattice of pixels (x, )€ ^? is called 
optical flow. Optical flow field is the set of instantaneous velocity vectors v(x, t) = 
(v, (x, 2), v, (x, 1)? = [dx,/dt dx,/dt\" for all (x, € A3. The dense correspondence 
(displacement) field is the set of 2D displacement vectors d(x, 2), for all (x, #)€ A’. 
The correspondence and optical flow vectors usually vary from pixel to pixel (space- 
varying motion), e.g., due to rotation of objects. 

The 2D dense-motion estimation problem can be posed as estimation of either: 


1. correspondence vectors d(x, t) = [d (x, 2), d,(x, £] T for all (x, 2) € A? or 
2. optical-flow vectors v(x, 2) = [v (x, ), v,(x, 2] T for all (x, ġ € A>. 


The lattice A? may consist of all pixels or a subset of them (as in block-based motion 
estimation used in video-compression standards). 


Dense-Correspondence Problem 


The dense-correspondence problem can be set up as a forward- or backward-motion 
estimation problem, as depicted in Figure 4.9, depending on whether the motion 
vector is defined from ¢ to t+ /At or from t to t— /At, where / is the frame counter 
and At is the temporal sampling interval. 


Forward Estimation Given two video frames at t and t+ /A¢ that are related by 
Sy (%>%2s#) =s (x, +4, (x), x, +d, (x),t+/Ar) (4.21) 
where the temporal argument of d(x) is dropped for ease of notation, or equivalently, 


sint) = he, ta Ehn + (x) 


time t+ /At 


time zt 


time t— /At 





Figure 4.9 Forward- and backward-correspondence estimation. 
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such that t= At, find correspondence vectors d(x) = [d, (x), d,(x)] T for all (x,t) E A2. 


Backward Estimation If we define correspondence vectors from time ¢ to t— /At, 


then 
sioe) = Sp ed, (x), a + ote) 


such that t= kAt. Alternately, the motion vector can be defined from time zt 一 人 At 
to t. Then, 


sla) 574 h(E) — d(x) 


In predictive video compression, backward-motion estimation is used in forward 
(P-mode) causal-motion compensation. Because x + d(x) does not generally corre- 
spond to a lattice site (integer pixel), the right-hand sides of these expressions must 
be evaluated using some sub-pixel interpolation scheme. The dense correspondence 
problem also arises in stereo-disparity estimation, where we have a left-right pair 
instead of a temporal pair of images. 


Optical-Flow Estimation Problem 


The optical-flow estimation problem can be posed as: given samples of s,(x,,x,, £) 
on a 3D lattice AÌ, determine the 2D instantaneous velocity v(x, ¢) for all pixels 
(x, t) e 人 3. 

Theoretically, continuous spatio-temporal intensity pattern s(x, 7) is required 
to determine the optical-flow field, since we need to analyze spatial and tempo- 
ral variations in the continuous spatio-temporal intensity pattern by computing 
partial derivatives. However, in practice, we estimate the optical-flow field from 
the available video data, which is spatially and temporally discrete intensity on a 
spatio-temporal lattice A>. Therefore, accurate estimation of spatial/temporal partial 
derivatives from discrete intensity data (discussed in Chapter 3) plays an important 
role in the precision of optical-flow estimates. 

We can make the following observations: i) correspondence vectors converge to the 
optical-flow vectors in the limit At= # — ¢ goes to zero; ii) estimation of optical flow 
and correspondence vectors from two frames are equivalent, with d(x, £} = v(x, À Az, 
provided that the velocity remains constant during each time interval /At#; and iii) we 
need to consider more than two frames at a time to estimate optical flow in the pres- 
ence of acceleration. 
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Image-Registration Problem 


Image registration is a special case of the correspondence problem, where a global 
parametric mapping or a collection of local parametric mappings can be defined 
between pairs of frames to register them on a common reference frame. For example, 
multiple exposures of a static scene taken by a panning camera can be registered by 
one of the parametric models in Section 4.2.3. 


Mosaic Representation (Image Stitching) 


Because individual images in a camera-pan sequence have varying field of view with 
some overlap between them, a panoramic field of view can be obtained by stitch- 
ing them together with proper blending of intensity of pixels that are registered on 
a single reference frame called a photo-mosaic [Ira 96, Sze 06]. Most recent digital 
cameras provide this functionality as a single-button option. 


4.3.2 Optical-Flow Equation and Normal Flow 


The fundamental principle of motion estimation is that the intensity of a pixel 
remains constant along the motion trajectory, which can be expressed in the form 
of the OFE. Assuming that space and time are represented by continuous variables, 
intensity constancy implies that the rate of change of intensity along the motion 
trajectory is zero, expressed as 


Hat) 
dt 


=0 (4.22) 


This is a total derivative expression since x, and x, vary with t along the motion 
trajectory. Using the chain rule of differentiation, we have 


Os (xxt) Ox | Os (XXt) OX) | Os (Xi Xs) _ 0 (4.23) 
Ox, Ot Ox, Ot Ot 

where v (x, t) = Ox,/Ot and v,(x, 2) = Ox,/Ot denote the components of the coordi- 

nate velocity vector in terms of the continuous spatial coordinates. This is known 

as the OFE or the optical-flow constraint, which can alternatively be expressed in 

vector form as 


Os, > t) 


(Vs. (x,t), v(x)) +——=— =0 (4.24) 
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Os,(x,t) OS GGE) H 


5x, bx, 


where Vs (x,t) = and (.,:) denotes the vector inner product. 








Can the OFE Uniquely Specify the Motion Field? 


The OFE (4.24) is not sufficient to uniquely specify the 2D velocity (flow) field, 
since it yields one scalar equation in two unknowns, », (x, t) and v,(x, 2), at each pixel 
site (x, ż). Inspection of (4.24) reveals that we can only estimate the component of 
the flow vector that is in the direction of the spatial-image gradient Vs (x, 2), called 
the normal flow v | (x, #), because the component that is orthogonal to the spatial- 
image gradient disappears under the dot product. The concept of normal flow is 
illustrated in Figure 4.10, where all vectors whose tip lie on the dotted line satisfy 
Egn. (4.24). The optical flow equation (4.24) can be rewritten as 


|| Vs. (x,£) |||] vex, 2) || cos 8 十 一 一 一 一 


Os, (x,t) =i 
ôt 


where £ is the angle between the vectors Vs (x, t) and v(x, 2). Then, the magnitude of 
normal flow |lv , (x, A|| at each site can be computed by setting the angle B = 0 


OS: Os,(x,t) t) 
v, (x, a. 4.25 
| t) ||= EAN (4.25) 


Thus, without additional motion-field modeling or assumptions, we can only 
determine motion that is parallel to the spatial-image gradient vector (orthogonal to 


v2 








3 Loci of v satisfying the 
we optical flow equation 


Vs (x1; X20) 


vi 


Figure 4.10 Normal flow. All vectors whose tip lie on the dotted line satisfy (4.24). 
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the edge), called the normal flow, at each pixel. Observe that the OFE requires that 
i) the spatio-temporal image intensity be differentiable, and ii) the partial derivatives 
of the intensity be estimated. 


4.3.3 Displaced-Frame Difference 


The displaced-frame difference (DFD) equation expresses the main principle of 
motion estimation that the intensity of a pixel remains constant along the motion 
trajectory in discrete spatial and temporal notation. In the case of forward-motion 
estimation, the DFD between time instances ż and ¢’ = t+ Ar is defined by 


DFD(x,d) = s,(x +d(x),t+Ar)—s (x,t) (4.26) 


where s (x, t) is the video and d(x) = [d, (x), d,(x)] T denotes the motion vector (MV) 
field between times t and t+ Az. We observe that i) since the components of d(x) are 
allowed to take non-integer values, interpolation is required to compute the DFD, 
and ii) if d(x) is equal to the true MV and there is no interpolation error, the DFD 
attains the value zero. 

We can expand s_(x + d(x), t+ Aż) into a Taylor series about (x, £), for small d(x) 
and Az, as 


s +d, (x), x, +d, (x),¢+Ar) =s (x,t)+ 
Loi ALETE ye at) ar et) + hot. (4.27) 


i Es t 


d,(x) 
Substituting (4.27) into (4.26), and neglecting the higher-order terms (h.o.t.), 


DFD(x,d) = 


cute d(x) + a) 7p gy 4 Dlr) A, (4.28) 


x x< Ot 
We investigate the relationship between the DFD and OFE in two cases. 


Case 1 


Limit At— 0: Setting DFD (x, d) = 0, dividing both sides of (4.28) by Az, and taking 
the limit as At approaches 0, we obtain the OFE 


Os Age, %5 58) es As aast) 


A LAA (x,t) + v, (x,t) 


=0 
Ox, Ox, at 
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where v(x, t) = [v, (x, 2), v,(x, 2] T denotes the velocity vector at time ¢. We see that 
setting the DFD equal to zero (or minimizing the DFD) is equivalent to imposing 
the OFE (4.24) in the limit as At 0. 


Case 2 


For At finite: An estimate of the displacement vector d(x) between any two frames 
that are Aż apart can be obtained from (4.26) in a number of ways: 


1. Search for d(x), which would set the left-hand side of Eqn. (4.26) to zero over a 
block of pixels (block-matching strategy). 

2. Compute d(x), which would set the left-hand side of (4.26) or (4.28) to zero 
on a pixel-by-pixel basis using a gradient-based optimization scheme (pel- 
recursive strategy). 

3. Set At= 1 and DFD(x, d(x)) = 0; solve for d(x) using a set of linear equations 
obtained from the right-hand side of (4.28) using a block of pixels. 


All three approaches can be shown to be identical if i) local variation of the 
spatio-temporal image intensity is linear, and ii) velocity is constant within the time 
interval Az, i.e., 


d (x)= ù (x,t)Aż and d, (x)= ô, (x,t)At 


In practice, DFD(x,d) hardly ever becomes exactly zero for any value of d, 
because: i) there is observation noise, ii) there is occlusion (covered/uncovered 
regions), iii) there are interpolation errors for non-integer MV, and iv) scene illu- 
mination may vary frame to frame. Therefore, we minimize the absolute value or 
square of the DFD or the left-hand side of the OFE over a block of pixels to estimate 
the 2D-motion field. Pel-recursive methods (see Section 4.6) employ gradient-based 
optimization to minimize the square of the DFD with an implicit smoothness con- 
straint (as opposed to search methods used in block-matching). 


4.3.4 Motion Estimation is 川 -Posed: Occlusion 
and Aperture Problems 


2D-motion estimation, posed as either a correspondence or optical-flow estimation 
problem, based on two frames, is an “ill-posed” problem in the absence of any addi- 
tional assumptions about the nature of the motion. A problem is called ill-posed if 
a unique solution does not exist, and/or solution(s) do(es) not continuously depend 
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on the data [Ber 88]. 2D-motion estimation suffers from all of existence, uniqueness, 
and continuity problems: 


。 Existence of a solution: No correspondence can be established for covered/ 
uncovered background points. This is known as the occlusion problem. 

。 Uniqueness of the solution: Treating x, and x, components of displacement (or 
velocity) at each pixel as independent variables, the number of unknowns is 
twice the number of equations (the frame difference at each pixel). Hence, the 
motion-estimation problem is under-determined. 

e Continuity of the solution: A small amount of observation noise in video 
frames may result in a large deviation in the motion estimates, i.e., the estimate 
is highly sensitive to noise. 


Therefore, the motion-estimation problem must be regularized by using motion 
models and/or priors. 


Occlusion Problem 


Occlusion refers to covering/uncovering part of an object or background from frame 
to frame due to motion of an object with respect to the camera, such that some pixels 
in the current frame do not have a correspondence in the reference frame. There are 
two sources of occlusion: 


1. mutual occlusion, where a moving object covers another object or part of the 


background (see Figure 4.11) and 
2. self-occlusion, when for example, an object rotates clockwise out of plane, the 
left edge gets covered and some new texture is uncovered from the right edge. 


Frame & k+1 





Background to be covered Uncovered background 
(no region in the next frame (no motion vector originating in 
matches this region) frame points into this region) 


Figure 4.11 Covered/uncovered background problem. 
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Covered and uncovered background concepts are illustrated in Figure 4.11, 
where an object shown by solid lines translates in the x, direction from frames 
k to k+ 1. The dotted region in frame k indicates the background to be covered in 
frame & + 1. Thus, it is not possible to find a correspondence for these pixels in frame 
k+ 1. The dotted region in frame k+ 1 indicates the background that is uncovered 
by the motion of the object. There is no correspondence for these pixels in frame 
k. Note that the roles of covered and uncovered regions are interchanged according 
to direction of motion estimation. Uncovered regions in backward-correspondence 
estimation become covered regions in forward estimation, and vice versa. 


Aperture Problem 


The aperture problem is a restatement of the fact that the solution to the 2D motion- 
estimation problem is not unique. There are many pixels or possibly blocks that are 
similar to the current pixel or block in the reference picture. Technically speaking, 
the number of equations (OFE or DFD) is equal to the number of pixels, but the 
MV has two components for each pixel, and the number of unknowns is twice that 
of equations. Hence, we can only determine the normal flow. This problem may be 
overcome if we assume all pixels within a block (at least two pixels) have a common 
MV. Given a block (aperture) to estimate an MV, there are three possible cases: 


。 Case 1: There are two linearly independent intensity gradient vectors within the 
aperture. Both v, and v, can be estimated uniquely. 

。 Case 2: There is only one intensity-gradient direction within the aperture. Only 
a normal flow (motion) vector can be estimated for the block uniquely. 

。 Case 3: There is no intensity gradient within the aperture. The motion does 
not result in observable temporal intensity variation, hence there are infinitely 
many solutions. 


The aperture problem is illustrated in Figure 4.12. Suppose we have a corner of 
an object moving in the x, direction (upward). If we estimate the motion based on a 
local window, indicated by Aperture 1, then it is not possible to determine whether 
the image moves upward or perpendicular to the edge. Recall that we have shown 
that the OFE only determines the component of the motion in the direction perpen- 
dicular to the edge, called the normal flow. 

If we observe Aperture 2, then it is possible to estimate the correct MV, since the 
image has a gradient in two perpendicular directions. Thus, it is possible to estimate 
the MV uniquely based on a block of pixels that contain sufficient gray-level varia- 
tion (gradient) [Hil 84]. Implicit in this discussion is the model that all pixels in the 
block translate by the same MV. 
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Aperture 2 







Aperture 1 


Normal 


flow “|e 


Figure 4.12 Aperture problem. 


The aperture and occlusion problems can be alleviated by employing motion 
models. Motion can be represented by global models, block models, or dense 
models. Global and block models can be classified as parametric models, whereas 
dense models are generally non-parametric. 


4.3.5 Hierarchical Motion Estimation 


The basic idea is to perform motion estimation successively at different levels of 
the resolution hierarchy, using a coarse-to-fine strategy based on a multi-resolution 
representation of each frame such as the Gaussian pyramid constructed by repeated 
blurring and down-sampling (see Section 3.2.3). The Gaussian pyramid representa- 
tions of the current and reference frames are depicted in Figure 4.13. Optical-flow 
(displacement) vectors are first computed on the top level (coarsest level with the 
least number of pixels) and then up-sampled and used to initialize the estimate at 
the next level. The lower resolution levels serve to determine rough estimates of the 
displacement that are successively refined at higher resolution levels. 
Hierarchical motion estimation has multiple benefits: 


1. It is effective in dealing with large motion vectors. At the coarsest level, motion 
vectors are smaller, helping with the linearization of the OFE. 

2. It helps to alleviate the aperture problem, since equal size blocks cover a larger 
image area at the upper levels of the pyramid, hence reducing the chance of 
singular image blocks. 

3. It helps to reduce the computational complexity, especially in search-based 
methods. Computation at the higher levels in the pyramid involves fewer 
pixels, hence is faster. The initialization at each level from the previous level 
means a smaller range at higher resolution levels and/or fewer iterations are 
required at each level. Hence, hierarchical methods are faster than single-level 
methods. 
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Increasing 
resolution 


Figure 4.13 Pyramid representation of each frame. 


While hierarchical implementations have many benefits as discussed above, they 
may fail to follow small objects with fast motion. Hierarchical implementation of 
differential and matching based motion-estimation methods are discussed in Section 
4.4 and Section 4.5, respectively. 


4.3.6 Performance Measures for Motion Estimation 


We can classify error measures for motion estimation as i) those that require ground- 
truth (GT) motion vectors, such as the norm of the error (NE), also called the end- 
point error [Bak 11], and the angular error (AE); and ii) those that don’t, such as the 
motion-compensation error (MCE). 

Given an estimated motion vector (d dp) at a pixel x, and the corresponding 
GT motion vector (d dY"), the norm of the error can be computed by 


The angular error between an estimated motion vector (d,,,d,,) at a pixel x, and 
the corresponding GT motion vector (d,’’ ,d;"") is the angle 0, in between a 3D vec- 
tor d, = (d,,d,,1.0) and df” = (dr ,dy ,1.0), which can be computed by using the 
dot (inner) product rule 





ce 中 = dd) +(d, -dÝ (4.29a) 


dy > 
lal la | 
This measure was first proposed by Fleet and Jepson [Fle 90] and has been used 


in the early comparative study of motion-estimation methods [Bar 94]. Note that 
the AE penalizes errors in large flows less than errors in small flows. 


cos0, = 


(4.29b) 
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Motion-compensation error, which requires no ground truth MVs, can be com- 
puted as the mean-square error between the actual current image s,(x,,x,) and its 
predicted version from the reference image s,_1 (x,,x,) using the estimated motion 
vectors (d,,d,) given by 


MSE =}, 3 [s, (x) sa (2 —d,,x, —d,)]° (4.29c) 


The statistics of pixel-based measures NE lle,l| and AE 0, including averages, 
standard deviations, and the percentage of pixels that have an error measure above 
value X, as well the motion-compensation error have been published to compare 
recent motion-estimation methods using the Middlebury stereo dataset [Bak 11]. 


4.4 Differential Methods 


Motion-estimation methods that utilize the spatial and temporal partial derivatives 
of images or the OFE are called differential or direct methods. Recall that the esti- 
mation of the image gradient has been discussed in Section 3.3. The OFE alone is 
not sufficient to determine MVs at each pixel since it specifies one equation in two 
unknowns per pixel (Section 4.3.2). In order to regularize this underdetermined 
estimation problem, we either employ parametric or non-parametric motion mod- 
els, which are generally known as the Lukas—Kanade method (Section 4.4.1) and the 
Horn—Schunk method (Section 4.4.2), respectively. 


4.4.1 Lukas—Kanade Method 


The Lukas—Kanade method [Luc 81] is one of the most popular 2D-motion esti- 
mation methods in digital-video processing. Its many extensions include hierarchi- 
cal model-based motion estimation [Ber 92] and the forward-additive (original), 
forward-compositional, inverse-additive, and inverse-compositional formulations, 
which have been shown to be equivalent [Bak 04]. Among these, the inverse com- 
positional algorithm is computationally the most efficient and can be used for direct 
estimation of most parametric models including homography. 

Here, we present the original formulation, where incremental additive param- 
eters are estimated, and we minimize the error in warping the current frame toward 
the previous frame, given by 


oxen [i (x’) -sn | (4.30a) 
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where B denotes an NX N block of pixels with sufficient gray-level variation and 
x’ = Tx; p) =[7,@p) T,@;p)] T denotes a parametric-motion model as a function 
of the model parameter vector p. This is a nonlinear optimization problem, which 
can be approximated by a quadratic cost function by replacing s,(x') by its Taylor 
series expansion given by Eqn. (4.27) assuming small motion. In order to ensure 
the motion is small, we assume a current estimate of the parameter vector p is avail- 
able and consider estimation of differential motion due to incremental parameter 
update Ap; i.e., we expand s, (x’), where x’ = T(x; p + Ap), into a Taylor series about 
the point T(x; p). The resulting cost function that is quadratic in Ap is given by 

2 


dos (4.30b) 





s,(T (x; p)) + Vs, (T(x; p)) = Ap — s,_,(x) 





where the term = is the Jacobian of the parametric model. Minimization of (4.32b) 
with respect to Ap is a least-squares estimation problem, which has a closed form 
solution. Computing the partial derivative of (4.30b) with respect to Ap and setting 
equal to zero, 
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is called the Hessian matrix. These expressions form the basis of the following hierar- 
chical iterative motion-estimation/refinement algorithm [Ber 92, Ira 93]. 


Hierarchical Iterative Refinement 


The hierarchical iterative-refinement method employs Gaussian pyramids for both 
frames combined with iterative MV updating within each resolution level to keep 
the incremental MV updates small, typically under one pixel. At each resolution 
level, the MV is initialized by that of the previous level multiplied by 2. At any 
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iteration at a level, the incremental MV is estimated using a reference image that is 
motion-compensated with the MV from the previous iteration so the incremental 
MV gets smaller. The iterative-refinement procedure can be summarized as follows: 


l. 


Estimate spatial/temporal image partials on frame s,(x) at the lowest resolution. 
Note that partials are estimated only once at each resolution level. Set the initial 
parameter vector p = 0. 

Compensate (warp) the current block of frame s,(x) toward s,_ (x) using the 
current estimate p to obtain s,(7(x;p)) by sub-pixel motion-compensation. 
Estimate Ap using (4.31). 

Update p = p + Ap. Repeat steps 2 and 3 a few times. 

Proceed to the next resolution level until we reach the highest resolution level 
in the pyramid. At each resolution level, scale the most recent parameter vector 
p. estimate spatial/temporal image partials on frame s,(x), and go to step 2. 

At the highest resolution level, repeat steps 2 and 3 until the residual parameter 
update Ap converges to zero. 


Special Case: Block-Translation Motion Model 


Here, we work out the scalar equations for the block-translation model, 
where the parameter vector rs ,d,) consists of two displacement val- 
ues. In this case, the Jacobian 3 Sr is ce identity matrix and the above deri- 
vation is equivalent to minimizing the error in the OFE. Lers define the 
error in the optical flow equation at pixel x as a function of the incremental 


MV Ap = (Ad,, Ad,) by 


4 9s.(%t) (x,t) 4 Is (%t) (x, DA 


Eee) TUES Ad Ad, 


把 Ke t 


e,(x, Ap) = 


Note that e (x, Ap) is in general not exactly zero at all pixels x within a block 
B, because i) there may be some errors in estimating the partial derivatives 
from discrete image samples, ii) there may be multiple motions within a 
block B, and iii) there may be intensity variations from frame to frame. The 
total square error over a block of pixels B is given by sum of squares of 
e (x, Ap), which can be expressed as 

ae Os, (x,t) GANTT Lian sé) T 


0 ý 
4 IS Ct) (x, dA 
“i Xo Ot 


(4.32) 
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We can minimize the total square error by computing partials of the error 
E (Ap) with respect to unknowns Ad, and Ad,, respectively, and setting 
them equal to zero, which yields two equations, for i= 1,2, 


a 
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Solving these two equations simultaneously, we have 
ðs, ae 让 | Ad, > = 2 Ad, > Coe a ae 
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which can be rewritten in vector-matrix form (by normalizing Aż = 1) as 
H Ap=b (4.33a) 
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Then, the estimate of the incremental parameter vector can be computed from 
Ap=H'b (4.33b) 


The matrix H needs to be invertible, i.e., rank 2, for a unique solution, which 
is satisfied if there is sufficient intensity variation within the block B (see 
Case 1 under the aperture problem in Section 4.3.4). The solution (4.33) 
is the least-squares solution of NX N optical flow equations (4.23), one for 
each pixel in the block B, with the same two unknowns. 

The derivation of the Lukas—Kanade solution for the affine model (4.17) 
or other parametric models that are linear in the unknown parameters, such 
as (4.18), (4.19), and (4.20), is straightforward and follows the same steps. 


Estimation of Partials 


Assuming we are estimating motion vectors from frame & to frame & — 1, the spatial 
partials are estimated at frame k, s,(x) using the techniques discussed in Section 3.3. 
The temporal partial can be approximated by the frame difference or estimated by 
(4.37). Clearly, the accuracy of the motion estimates depends on the accuracy of the 
estimated spatial and temporal partial derivatives. 


Spatial Weighting 
It is possible to increase the influence of some pixels in block B by appropriate 


weighting. A 2D Gaussian or triangular weighting may put higher emphasis on pix- 
els toward the center of a block B. 


Composition of Warps and Computational Complexity 


The compositional approach, proposed in [Shu 00], iteratively solves for an incre- 
mental warp AT(x; p) rather than an additive update to the parameters Ap as in the 
original Lucas—Kanade formulation. Then, at each iteration, we compute composi- 
tion of incremental warps, which is equivalent to a bilinear combination of additive 
update parameters [Bak 04]. The compositional formulation results in a more com- 
putationally efficient solution since the Jacobian can be pre-computed and re-used 
at each iteration. 


Direct Methods for Homography Estimation 


Direct nonlinear optimization of (4.30) for homography estimation is likely to get 
stuck at a local minimum without sufficiently close initial estimates [Sze 96]. The 


226 Chapter 4. Motion Estimation 


inverse compositional formulation has been shown to provide an efficient solu- 
tion (see Appendix in [Bak 04]). Linear methods for homography estimation given 
matched pixel correspondence pairs are addressed in Section 4.5.5. 


4.4.2 Horn—Schunk Motion Estimation 


Horn and Schunck [Hor 81] employ a non-parametric motion model to regularize 
the ill-posed motion-estimation problem. They seek a motion field that satisfies the 
OFE with the minimum pixel-to-pixel variation among the flow vectors in order 
to impose a global smoothness constraint on the velocity field. Hence, representing 
the spatial and temporal coordinates by continuous variables, motion estimation is 
posed as a variational optimization problem to minimize a global energy function 
of the form 


(x) = arg min | f (Ej(v(x)) + a7 E?(v(x))) dx (4.34a) 


B 
where B denotes continuous image support, and 


Eg (vlx)) = (Ws, (xt), v6) + ŽE (4.34b) 


is the error in the optical flow equation, which imposes data consistency. The second 
term Æ (v(x)) is a smoothness prior, where pixel-to-pixel variation of the velocity vec- 
tors can be quantified by the sum of magnitude squares of the spatial gradients of the 
components of velocity vector, given by 


E? (v(x))= || Vv (x) |P + || Vv, (x) ||? 


_ [nwt , (80,(x) 
ôx, ôx, 
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It can easily be verified that the smoother the velocity field, the smaller E? (v(x)). 
The parameter «, usually selected heuristically, controls the strength of the smooth- 
ness constraint. Larger values a? increase the influence of the constraint. 
Minimization of the functional (4.34) is treated as a calculus of variations prob- 


lem leading to the Euler-Lagrange equations given by 


6E(v(x)) 5 SE(v(x)) 6 SE(v(x)) _ 


0 43 
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SE(v(x)) ô SE(v(x)) 5 SE (v(x) _ 


0 (4.35b) 
ôv, ôx, OA ôx, dv, y, 


where v,,, = Ôv, /5x,, Vix, = Ov /Ox,, Vay = OV, (Ox, and v, 。 = ôv, /dx,. These 
equations, which are linear in the unknowns v and v,, can be solved by the Gauss- 
Seidel iterations, 
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where spatial partials are evaluated at (x, ¢). The complete derivation can be found in 
{Hor 81]. The initial estimates of the velocities vn” (x,t) and vi” (x,t) are usually set 
to zero, and all spatial and temporal partials are estimated from the observed images. 
Horn and Schunck [Hor 81] proposed to estimate both spatial and temporal partials 
by averaging four finite differences, where the temporal partials are estimated from 
two frames by 


dott) o Tsm m k+- simm k] + slm +1,n,,k+1] 
t 
— sin +1, mk] + simn +1,k+1]— simn +1,4] (4.37) 


tilm +18, tLe t-sia tha t Lk 


Other methods to estimate spatial-image partials, e.g., using derivatives of Gauss- 
ian filters, have been discussed in Section 3.3. 

The Horn—Schunck method imposes optical-flow and smoothness constraints 
globally over the entire image, or over a selected window. This has some undesired 
effects: 


1. A global smoothness constraint blurs “motion edges.” For example, if an object 
moves against a stationary background, there is a sudden change in the motion 


228 Chapter 4. Motion Estimation 


field at the boundary of the object. Motion edges can be preserved by impos- 
ing the smoothness constraint along object boundaries but not perpendicular 
to motion boundaries. This is the basic concept of the so-called directional or 
oriented smoothness constraints, discussed next. 

2. The optical-flow constraint has also been enforced at the occlusion regions, 
where it is indeed not valid. The OFE must be enforced selectively by varying 
a adaptively to control the relative strengths of the optical-flow and smooth- 
ness constraints. For example, at occlusion regions, such as the dotted regions 
shown in Figure 4.11, the optical-flow constraint should be turned off, while 
the smoothness constraint must remain fully on. 


Adaptive Smoothness Constraints 


Several researchers proposed to impose adaptive smoothness constraints. Hildreth 
[Hil 84] minimized the criterion function of Horn and Schunck given by (4.34) 
along object contours. Nagel and Enkelman [Nag 86, Enk 88] introduced a direc- 
tional smoothness constraint, which suppresses the smoothness constraint in the 
direction of the spatial-image gradient. Fogel [Fog 91] used the directional smooth- 
ness constraint with adaptive weighting in a hierarchical formulation. Note that 
adaptive weighting methods require strategies to detect moving object (occlusion) 
boundaries. Snyder [Sny 91] proposed a general formulation of the smoothness con- 
straint that includes some of the above as special cases. The directional smoothness 
constraint can be expressed as 


E7.(v(x)) = (Vv) W (Vv) + (Vv, ) W (Vv,) (4.38) 


where W is a weight matrix to penalize variations in the motion field depending on 
the spatial changes in gray-level content of the video. Various alternatives for the 
weight matrix W exist [Nag 86, Nag 87, Enk 88]. For example, W can be chosen as 


= F+ôl 
trace(F + ôI) 


where I is the identity matrix to ensure a non-zero weight matrix at spatially uniform 
regions, 8 and are global scalar constants, and 
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Observe that the method of Horn and Schunck (4.34) is a special case of this for- 
mulation with 6 = 1 and F=0. A Gauss-Seidel iteration to minimize the problem 
formulation using (4.44) has been described in [Enk 88], where the update term at 


























each iteration is computed by means of a linear algorithm. The performance of the 
directional-smoothness method depends on how accurately the required second and 
mixed partials of image intensity can be estimated. A hierarchical implementation of 
adaptive smoothness constraints can be found in [Fog 91]. 


Median Filtering as an Energy Function 


Sun et al. [Sun 10] observed that median filtering of the intermediate flow results, 
once after every iteration, e.g., a 5 X 5 median filter, results in significantly better 
results, although this leads to higher energy solutions. Hence, they propose a new 
objective function that formalizes the heuristic median filtering. This objective func- 
tion includes a non-local term that robustly integrates flow estimates over large spa- 
tial neighborhoods. A hierarchical estimation procedure has been proposed, where 
they alternate between minimizing a classical Horn—Schunk type of energy function 
and a new median-filtering energy function 10 times at every level of the pyramid. 


4.5 Matching Methods 


We can classify matching methods as i) block-matching, which assigns a forward 
and/or backward-motion vector to blocks of pixels, where blocks may be overlap- 
ping or non-overlapping, and ii) sparse feature-matching methods. Block-matching 
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motion estimation, which can be performed using fixed or variable size blocks, is 
commonly employed in video-compression standards such as ISO/IEC MPEG-2, 
which uses fixed-size blocks, and advanced video coding (AVC) and high-efficiency 
video coding (HEVC) (ITU-T H.264/265), which allow variable size blocks. Vari- 
able block-size motion-estimation offers a tradeoff in video coding such that larger 
fixed size blocks reduces the number of bits needed to encode MVs at the expense 
of an increase in the number of bits to encode the prediction residual, while using 
smaller variable size blocks can result in a reduction in the number of bits needed 
to encode prediction residual at the expense of an increase in the number of bits to 
encode MVs. Hierarchical block-matching is often preferred since it increases esti- 
mation accuracy and helps reduce search complexity. Sparse feature matching aims 
to match a predetermined set of feature points between pairs of frames, which can 
be later used in feature-point tracking or to determine parameters of a motion model 
from some number of feature correspondences. 


4.5.1 Basic Block-Matching 


The basic block-matching procedure takes a fixed size block from the present frame 
and searches for the location of the best-matching block of the same size in a (past 
and/or future) reference frame based on some distance criterion (see Figure 4.14). 
Block-matching algorithms differ in the choice of the block size, matching (distance) 
criteria, and search strategy employed. 

The matching error can be quantified according to several criteria including 
minimum mean-square error (MSE), minimum mean absolute difference (MAD), 


maximum cross-correlation, maximum matching pel count (MPC), and so on. The 
MSE criterion is defined by 


1 
MSE (d,,4)=—— Dy, n) ca |S)» k]— sl, +d n, +d k1]? (4.39) 
N N3 12 


where B is an N, X N, block, and (d,,d,) denotes a candidate MV. The MAD crite- 
rion, defined by 


1 
NN, 





MAD (d,,d,) = Leis aes | Mak) sin, Ae), FD] (4.40) 


is the most popular choice for very large scale integration (VLSI) implementations. 
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Reference frame < 和 -一 Current frame 


Figure 4.14 Block-matching. 


The estimated MV is the value of (d,,d,) that minimizes (4.39) or (4.40) 
[d, d, = arg miny dy) MSE (d, da) 


or 


[d, d,] =arg minų aq) MAD(d,,d,) 


The performance of the MAD criterion may deteriorate as the search area 
becomes larger due to presence of several local minima. 

For video coding, every frame is partitioned into fixed or variable size blocks and 
one (for uni-directional) or two (for bi-directional) MV is computed for each block. 
The block sizes may vary between 8 X 8 and 64 X 64 (in the HEVC standard). For 
other applications, a common approach to computing a dense-motion field using 
block-matching is to estimate motion vectors on a sparse grid of pixels, e.g., once 
every four pixels and four lines with partially overlapping blocks of size N, = N, = 16, 
and then interpolating the remaining vectors to obtain a dense motion field. 


Full Search 


Finding the best-matching block requires computation of the matching criterion 
for all candidate motion vectors (d, d,) at each pixel (7,,7,). This procedure, called 
“full search,” is time-consuming. In order to reduce the computational load, we can 
limit candidate motion vectors to within a (2M + 1) X (2M + 1) “search window” 
(depicted in Figure 4.14) such that 


-M =d, =M and -M £d, = M 
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which is centered about each pixel for which a MV will be estimated, where M is a 
predetermined integer. Full search is often preferred for hardware implementation 
due to its predictable data flow and regular memory access. 


Fast Search 


In real-time software implementations, search strategies faster than the full search 
are needed, although they may lead to sub-optimal solutions. Note that in motion- 
compensated compression, we just seek a matching block, even if the match does 
not correlate well with the actual projected motion. Hence, fast-search algorithms 
serve video-compression applications reasonably well. Fast search algorithms can be 
classified as: i) those that eliminate some candidate MVs based on mathematical 
lower bounds and give the same result as that of the full search, e.g., the successive 
elimination method and its improvements; ii) those that evaluate the criterion func- 
tion only at a subset of candidate motion vectors, e.g., logarithmic search, three-step 
search, and diamond search; and iii) early-termination methods that terminate the 
search procedure, for example, by predicting zero motion vector or zero transform 
coefficients for some blocks [Yan 05]. 

The successive elimination algorithm (SEA) [Li 95] is based on the triangle 
inequality 


>, >, OLT N D = |s[7, +d,,n, +d,,k—I]| = 


(m,m) EB (m,m) EB 


>> dS ink] sin thm + d,,4—-1]| = SAD(d,,d,) 


(n,m) EB 


where SAD (d,,d,) denotes the sum of absolute differences. Hence, the difference 
between the sum of intensity values and the sum of displaced intensity values, which 
can be efficiently computed using the box-filtering technique, establishes a lower 
bound for the SAD value. In the SEA, the difference on the left-hand side for each 
candidate MV is compared to the previous minimum SAD value, and the SAD 
computation is skipped if the lower bound is greater than the previous minimum. 
Logarithmic search proposed by Jain and Jain [Jai 81] and three-step search (TSS) 
proposed by Koga et al. [Kog 81] are both multi-step search procedures that define an 
initial step size and a set of search points centered at the center of the search window and 
terminate when the step size reduces to one. They both have complexity O(log(M/2)) 
but the logarithmic search is generally more accurate. The logarithmic search begins 
with calculating SAD at the center of the search window (zero motion vector) and four 
points that are +P pixels, where typically P= M12, from the center in the horizontal and 
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vertical directions that are marked with 1 in Figure 4.15. If the minimum SAD occurs 
at the center, then the step size is halved. Otherwise, the step size remains unchanged, 
the next search is centered about the pixel with the minimum SAD, and SAD values 
at new four points +P pixels from the new center (marked with 2 in Figure 4.15) are 
calculated. When the step size becomes 1, the nine neighbors about the current center 
are searched and the pixel with the minimum SAD defines the final integer MV. 

The diamond search (DS), proposed by Zhu and Ma [Zhu 00], employs two dia-’ 
mond-shaped search patterns: a large diamond with nine search points and a small 
diamond with five points, that are shown in Figure 4.16(a). DS starts on the large 
diamond centered at the center of the search window and searches for the minimum 
SAD location. If the minimum SAD does not occur at the center, the next search 
is centered at the pixel with the minimum SAD value and the search continues 
with the large diamond pattern until the minimum occurs at the center of the large 
diamond. When the minimum occurs at the center, the final search is conducted 
using the small diamond centered about this center pixel, and the location of the 
minimum SAD on the small diamond defines the final integer MV. The complexity 
of DS is O(log(M/2)) and it has better accuracy than logarithmic search and TSS. 

Recently, more sophisticated fast-motion estimation schemes, such as UMHexa- 
gonS [Xu 08], which include: i) initial search-point prediction (rather than starting 
at the center of the search window); ii) combination of multiple search schemes, 
such as cross-search, uneven multi-hexagon search and diamond search; and iii) early 
termination criteria have been proposed. 





Figure 4.15 Four-step logarithmic search. The best match position at the end of each step is 
indicated by a square. The arrow shows the final motion vector. 
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(a) (b) 


Figure 4.16 Diamond search: (a) big diamond (light) and small diamond (dark) centered about 
the center of the search window and (b) a four-step diamond search, where the first three steps use 
big diamond and the last step uses small diamond to find an integer estimate (arrow). 

The best match position at the end of each step is indicated by a square. 


Sub-Pixel Search 


In most applications, motion vectors are estimated by higher than integer pixel pre- 
cision, called sub-pixel precision. For example, most video-compression schemes 
employ half-pixel or quarter-pixel precision MVs. The computational complex- 
ity of sub-pixel motion estimation is higher due to the interpolation required to 
compute in-between pixels and a larger number of candidate blocks that need 
to be evaluated. Hence, sub-pixel search is conducted only in the vicinity of the 
best integer MV, which is estimated first. We evaluate SAD at the eight half-pixel 
positions around the best integer MV (depicted in Figure 4.17) to check whether 
it can be lowered. If needed, we next evaluate SAD at the eight quarter-pixel 
positions around the best half-pixel MV to find the best quarter pixel MV. Typi- 
cally, half-pixel sample values are evaluated by using a separable six-tap FIR filter 
horizontally and vertically, while quarter-pixel sample values are computed by 
bilinear interpolation between full and computed half-pixel samples. It has been 
shown that the filters used for interpolation have an impact on the accuracy of the 
estimated MVs. 


4.5.2 Variable-Size Block-Matching 


In block-motion estimation, it is assumed that all pixels within a block move uni- 
formly that can be described by a single, common motion vector. Blocks containing 
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Figure 4.17 Sub-pixel search locations: (1) best integer MV with the grid indicating pixels; 
(2) half-pixel search positions; and (3) quarter-pixel search positions about the best half-pixel MV. 


motion edges pose a challenge in fixed block-size motion estimation since multi- 
ple distinct motion vectors may be needed. A convenient approach to overcome 
this problem is to subdivide such blocks so that pixels within each subblock, which 
can be as small as 4 X 4, have a single MV resulting in so-called variable size block- 
matching (VSBM). This subdivision is usually performed using a quad-tree for efh- 
cient representation of the block partitioning. In VSBM, smaller blocks are used in 
image regions with complex motion, while larger blocks can be used where the image 
is stationary or undergoes uniform motion. VSBM algorithms provide the ability to 
dynamically adapt the block size to the nature of the motion field and consist of two 
steps that are coupled: i) selecting the best partitioning of a square block of pixels, 
and ii) finding the best motion vector for each sub-block. 

The main idea in efficient implementation of VSBM is to reuse SAD computa- 
tions as much as possible, including i) compute SAD for the smallest sub-block size 
(typically 4 X 4) and then compute SAD for larger blocks by summing the appro- 
priate combination of these SAD values, and ii) reuse partial SAD values for one 
candidate MV for computing SAD for other candidate MVs. Several architectures 
have been proposed for hardware implementation of VSBM, including FPGA, 1D 
and 2D systolic arrays, and ASIC implementations. 

Among software solutions, the Test Zone Search (TZSearch) [Pur 12] algorithm 
was adopted in HEVC (the most recent video-compression standard) reference soft- 
ware as a fast ME algorithm for reducing search time with an RD performance 
comparable to that of full search. The TZSearch includes two steps: determination 
of initial search point and a search procedure. The initial search point is determined 
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by using a set of predictors, including median predictor (MP), left predictor (LP), 
above predictor (AP), above-right predictor (ARP) and zero MV (0,0). The LP, AP 
and ARP take MV of the left, top, and top-right block, respectively. MP takes the 
median of them. After the initial search point is determined, a hybrid search, includ- 
ing multiple diamond/square search and raster search, are used to locate the best 
matching block with the minimum RD cost. Further improvements to TZSearch 
have also been proposed [Pan 13]. 


4.5.3 Hierarchical Block-Matching 


The basic principle of hierarchical-motion estimation has been discussed in Section 
4.3.5. In hierarchical block-matching, we compute Gaussian pyramid representation 
for both frames and perform search successively at different levels of the hierarchy, 
starting with the lowest resolution level. At lower resolution levels, rougher MV 
estimates are determined using relatively larger blocks, where “relative size of the 
block” is measured as the size of the block normalized by the size of the image at that 
resolution level. The estimate of MV at a lower resolution level is then passed onto 
the next higher resolution level as an initial estimate. Higher resolution levels serve 
to finetune the MV estimate with a relatively smaller window size and a good initial 
estimate. If more than one MV yields similar SAD at the lower resolution levels, then 
they can all be refined at higher levels and the best MV is selected among them at 
the end. The increase in computational complexity is small compared to that in the 
quality of the results. 

Figure 4.18 illustrates hierarchical block-matching with two levels, where the 
search range M = 7 for Level 2 (lower resolution) and M = 3 for Level 1 (higher 
resolution) [Bie 88]. For simplicity, we assume that images at all levels of the pyramid 
are the same size but successively more blurred as we go to lower resolution levels. In 
the two-level pyramid, we simply skip over every other pixel in the low-resolution 
level (when computing matching criterion) to simulate the effect of sub-sampling. 
The best estimate at the lowest resolution level is indicated by the circled “3.” The cen- 
ter of the search area in Level 1 (denoted by “0”) corresponds to the best estimate from 
the second level. The estimates in the low and high levels are [7,1]! and [3,1] respec- 
tively, resulting in an overall estimate of [10,2] Half-pixel and quarter-pixel search 
can also be performed about the integer MV by using appropriate interpolation filters. 

We note that the SEA algorithm for faster full search has been extended to 
hierarchical-motion estimation and is called multi-level SEA or MSEA, whose com- 
putational efficiency is further improved by eliminating some redundant terms in 
the test condition [Ahn 03]. 
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Level 2 
(lower resolution) 





Figure 4.18 Example of hierarchical block-matching with two levels. 


4.5.4 Generalized Block-Matching — Local 
Deformable Motion 


Spatial transformations via parametric models, such as (4.17) to (4.20), provide 
superior image registration and rendering, especially in the presence of local defor- 
mations, compared to block translation model (4.16). Such local deformable motion 
can be modeled by 2D rectangular or triangular mesh models with or without con- 
nectivity constraints. 

The concept of block-matching can be extended to estimate the parameters of 
these more sophisticated motion models. This, of course, requires higher computa- 
tional complexity; we now have to perform search in 6D (affine) or 8D (perspective) 
parameter space instead of a 2D space. Here, we present two generalized matching 
schemes: the full search, which does not impose connectivity constraints, and hex- 
agonal matching, which does. Connectivity is not valid at motion/occlusion bound- 
aries. Hence, in practice a combination of both approaches, where connectivity is 
turned off at occlusion boundaries, should be preferred. 

The full-search method, without connectivity constraints, can be summarized as: 


— 
. 


Segment the current frame into rectangular blocks as illustrated in Figure 4.19. 

2. Perturb coordinates of the four corners of the co-located block in the reference 
frame starting from an initial guess to form a candidate-matching quadrilateral. 

3. For each candidate, find the model parameters that map this quadrilateral onto 
the rectangular block in the current frame using the four corners as feature 
correspondences. 

4. Perform the spatial transformation using the computed parametric model, and 
calculate the MSE between the given block and the matching quadrilateral. 

5. Choose the spatial transformation that yields the smallest MSE or MAD. 
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Figure 4.19 Motion compensation without connectivity constraints on a 2D rectangular mesh. 


In order to reduce the computational burden imposed by generalized block- 
matching, it is only used for those blocks where standard block-matching is not 
satisfactory. The displaced frame difference resulting from standard block-matching 
can be used as a decision criterion. 

A connectivity preserving motion-estimation method applied to 2D triangular 
meshes, called hexagonal matching, was proposed by Nakaya and Harashima [Nak 
94]. The hexagonal search is based on the observation that there are six lines inter- 
secting at each node in a uniform triangular mesh, and the boundaries of these six 
triangles define a hexagon as depicted in Figure 4.20. Assuming the motion vec- 
tor is constrained to stay inside this hexagon, each node (in the reference frame) 
is perturbed one at a time to find the best matching hexagon between the current 
and reference frames. The SAD for a hexagon is computed by mapping the texture 
within each of six triangles affected by a perturbed node by their respective affine 
parameters. Later, Toklu et al. [Tok 96] and Altunbasak et al. [Alt 97] have proposed 


improvements to mesh-based motion estimation. 





Figure 4.20 Motion compensation with connectivity constraint on a 2D triangular mesh. 
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4.5.5 Homography Estimation from 
Feature Correspondences 


This section presents methods for homography estimation from a number of pre- 
determined feature correspondences. Detection of “good” features and precision 
(sub-pixel) of correspondence estimation play important roles in the accuracy of 
the parameter estimates. For each pre-determined feature point 7, we can express 
the projective transformation (4.14) in the homogeneous coordinates z, where 
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where z is set equal to 1. Observe that H can be multiplied by a non-zero constant 
。 。 . : / / / . : 
without altering the image coordinates x/ =z} /z/, and x/, =z}, /z!,, which is 
referred to as scale ambiguity. Thus, H is a homogeneous matrix with only 8 degrees 
of freedom even though it has 9 entries. Expressing 
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Hence, we obtain two linearly independent equations for each feature point cor- 
. . . . . . / 
respondence, excluding the point at infinity, i.e., assuming z,, # 0. 


Direct Linear Transformation (DLT) Method 


Given N= 4 feature-point correspondences, such that no three points are collinear, 
we can obtain 2N homogeneous linear equations in the form 


A, 
h=Ah=0 (4.41b) 
Ay 


where A is a 2X 9 matrix and h is a 9 X 1 vector consisting of unknown homog- 
raphy parameters. 

If N= 4 or N> 4 but all point correspondences are exact (noise-free), then A has 
rank 8. Hence, A has a 1D null space that provides a solution for h that can only be 
determined up to a non-zero scale factor. Recall that H is defined up to a scale factor 
anyway. A scale factor may arbitrarily be chosen by setting ||h|| = 1, which avoids the 
trivial solution h= 0. If N> 4 and the correspondences are noisy, then the over- 
determined system A h=0 is inconsistent and does not have a solution. We can 
then find a least-squares solution for h, which minimizes ||A hl|, subject to ||h|| = 1. 
In either case, h is given by the last column of V, where A= UXV!' is the singular 
value decomposition (SVD) of A (see Appendix D). 

An alternative derivation of the DLT method that is based on the fact that 
z. XH z, =0 and keeps all three equations to include the case z/, =0 (a point at 
infinity) when dehomogenization leading to Eqn. (4.41a) is not possible can be 
found in [Har 04]. 

We note that it is possible to estimate the parameters of the homography (4.14) 
by setting 4, = 1 as a scale parameter first and solving the resulting set of inhomoge- 
neous equations. However, this approach gives poor results if the actual value of 4, 
is close to 0, and hence is not recommended. 


Normalization 


The performance of the basic DLT algorithm depends on the origin and scale of the 
coordinate system for both image frames. Hence, normalization of the pixel coor- 
dinates helps to obtain numerically stable solutions [Har 04]. The normalized DLT 
algorithm works as follows: 
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1. Compute a similarity transform for the first frame x, = Tx, such that the origin 
of the new coordinate system is the centroid of points x, and the average dis- 
tance of points x, from the origin is V2. 

2. Compute a similarity transform ¥/ = T’x/ for the second frame the same way, 
independent of the first frame. 

3. Apply the DLT algorithm using the point correspondences (X,,X;) to obtain 
homography H. 

4. The desired homography matrix is given by H = mI HT. 


The normalized DLT method, also called the normalized 8-point algorithm, pro- 
vide quite satisfactory results [Har 97]. 


Other Cost Functions 


The solution of the overdetermined linear homogeneous equations (4.41b) is an opti- 
mization problem. The DLT method minimizes the so-called algebraic error ||Ah||, 
subject to |lb|| = 1. Other cost functions that have been considered include the geomet- 
ric error (uni-directional or bi-directional transfer error), the reprojection error, and 
the Sampson error [Har 04]. However, minimization of these alternate cost functions 
result in more complex iterative estimation methods and will not be considered here. 


Other Models 


Parameters of bi-linear or quadratic models can similarly be estimated from at least 
four point correspondences. An affine model can be estimated from at least three 
point correspondences. A comprehensive overview of parametric-motion estimation 
methods can be found in [Sze 06]. 


4.6 Nonlinear Optimization Methods 


This section discusses methods which employ nonlinear optimization schemes for 
2D-motion estimation. Pel-recursive motion estimation that uses gradient descent 
minimization is presented in Section 4.6.1. Bayesian motion estimation that uses 
probabilistic smoothness priors and nonlinear optimization to compute the maxi- 
mum a posteriori (MAP) estimate of motion field is introduced in Section 4.6.2. 


4.6.1 Pel-Recursive Motion Estimation 


Pel-recursive methods are predictor-corrector type estimators computed sequen- 
tially at each pixel. Pel-recursive motion estimation is usually preceded by a 
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change-detection stage, where the frame difference at each pixel is tested against a 
threshold. Estimation is performed only at those pixels belonging to the changed 
region. 

An early pel-recursive approach is the Netravali-Robbins algorithm [Rob 83], 
which minimizes the square of the DFD at each pixel, given by 


E(x,d)= [DFD(x,d)] (4.42) 


where DFD (-) denotes the displaced frame difference. Minimization of E(x, d) with 
respect to d, at pixel x, by the steepest descent method yields the iteration 


å“ (x) = d(x) —eDFD(x,d) Vs (x—d®;t— At) (4.43) 


where V is the gradient with respect to x, and is € the'step size. The initial estimate 
d(x) can be taken as the MV at the previous pixel or as a linear combination of 
previously computed MVs in a neighborhood of the current pixel. Note that the 
negative of the gradient points in the direction of the steepest descent. In (4.50), the 
first and second terms are prediction and update terms, respectively. The aperture 
problem is also apparent in the pel-recursive algorithms. Since the update term is a 
vector along the spatial gradient of image intensity, no correction can be performed 
in the direction perpendicular to the gradient vector. 

The rate of convergence of the Netravali-Robbins algorithm depends on the 
choice of the step size parameter £. For example, if e = 1/16, then at least 32 itera- 
tions are required to estimate a displacement by two pixels. On the other hand, a 
choice of a large step size may cause oscillatory behavior. In order to facilitate faster 
convergence, an adaptive step size, 


1 

lV. d; #— At) ||? +e? 
has been proposed with a bias term c? to avoid division by zero in areas of constant 
intensity where the spatial gradient is almost zero. In addition, Walker and Rao [Wal 
84] have introduced the heuristic rules: i) If the DFD is less than a threshold, the 
update term is set equal to zero. ii) If the DFD exceeds a threshold, but the magni- 
tude of the spatial-image gradient is zero, then the update term is again set equal to 
zero. iii) If the absolute value of the update term (for each MV component) is less 
than 1/16, then it is set equal to + 1/16. iv) If the absolute value of the update term 
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(for each component) is more than 2, then it is set equal to + 2. Experimental results 
indicate that using an adaptive step size improves the convergence of the algorithm. 

An alternative method, called Wiener-based motion estimation [Bie 87], com- 
putes the least-squares estimate of the update term, given the previous motion vector. 


4.6.2 Bayesian Motion Estimation 


Bayesian methods utilize probabilistic data consistency and motion smoothness 
constraints, in the form of a probability distribution function (pdf), in order to 
regularize the motion-estimation problem. Probabilistic smoothness priors can be 
equivalently expressed in the form of a Markov—Gibbs random field, which models 
local interactions between motion vectors (see Appendix B). Te objective of Bayes- 
ian motion estimation is to maximize the a posteriori pdf p(d,,d,|s,,s,_,) of the 
motion field, where d}, d, denote the lexicographic ordering of the components of 
the MV at each pixel, given the observable data, i.e., a pair of image frames s, and 
Bi According to the Bayes rule 


pld pd, |s,,s, 1) = plse|di,d,,s, 1) pl(di,d, | s,_1) 

P(s, | Se) 
where the denominator p(s,|s,_,) is independent of (d,,d,), hence it is a scal- 
ing constant, and we assume that the a priori pdf does not depend on s,_,, i.e., 


p(d,,d,|s,_,) =p(d,,d,). 
Basics of Bayesian Motion Estimation 


The conditional pdf p (s z| di,d,, s,_1) describes the distribution of the current frame 
s, given the reference frame s,_, and motion vectors. Hence, it models the probabil- 
ity distribution of the displaced frame difference (DFD) (4.26), which is assumed to 
be a zero-mean Gaussian given by 


-L E, [s(x)-s (21D)? 


pls, | d,,d,,s,_,) ~e” 
or the pdf of the error in the optical flow equation (4.23), given by 
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The a priori pdf p(d,,d.,) of the motion vector field is defined by a Gibbs distri- 
bution [Gem 84], 


2 
Do xy yec lex; )-d¢x,)I| 


p(d,,d,) =e 4 





which encourages a smooth motion field, where C defines the set of two-pixel 
cliques within a neighborhood system (see Appendix B) and T is temperature 
(Appendix C). Since the logarithm is a monotonic function, we can maximize 
In{p(s, |d,,d,|s,_,) p(d,,d,|s,_,)} or equivalently minimize its negative to find the 
MAP motion estimates. Then, the MAP estimates d,,d, minimize 


E(d,.d,)= =z, Z, s(x) — 54x +da)? + Eec | d(x,) df)? (4.44) 


and the Bayesian motion-estimation problem reduces to an energy minimization 
problem. Observe that the energy function (4.44) is similar to (4.34), which is used 
in the Horn—Schunck method, in that both contain a data consistency term and a 
smoothness term. The main difference is that this is a nonlinear optimization prob- 
lem with possibly many local minima. Simulated annealing, greedy methods, or fast 
primal-dual (Fast-PD) optimization [Glo 08] have been used to minimize (4.44). 
Hierarchical Bayesian motion-estimation formulations have been shown to yield a 
more regular energy function with fewer local minima. 


Bayesian Motion Estimation with Discontinuity Modeling 


The basic Bayesian motion-estimation formulation imposes data consistency and 
global smoothness constraints across the entire image, hence failing to deal with 
motion boundaries and occlusion areas properly. The line field has been introduced 
into the Bayesian framework to model motion boundaries [Kon 92]. The maximum 
a posteriori probability (MAP) estimate of the MV field (d, d,) and the line field 1 
is defined by 


(di,d,,)) = arg maxa a,1 P(d,,d,,1|s,,5,_,) (4.45) 
where I is a binary line (segmentation) field that models discontinuities in the MV 


field, where /,, = 1 indicates a motion border is present between sites x, and x.. Hence, 
we now have three unknowns per pixel, d}, d, and /. Using the Bayes theorem, 
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_ pls, |d,,d,,1,8,,) pl(di,d, |L s41) dl | ss) 
p(d,,d,,1|s,,8,4) pe, laa) 
where p(s,|d,,d,,1,s, ,), p(d,,d, |I,s,_,), and pll|s,_1) are modeled by Gibbs 
distributions with potentials U(s,|d,,d,1,s,_,), Ufd,,d,|1s,_,), and U{l|s,_,)> 
respectively. The first potential is a likelihood function derived from a Gaussian pdf, 
while the latter two are Gibbs priors that can expressed as the sum of clique poten- 
tials to impose smoothness constraints. 

The MAP estimate of the motion field and associated line field can be computed 
by minimizing a cost function that consists of an optical flow error (first term) and a 
Gibbs potential (second and third terms) (penalizing discontinuities in the estimated 
motion field) [Kon 92] as 


A 


(d,,d,,1) = arg ming a, 1U, (s, | d,,d,,1,s,,)+ 
àU (d,d, | I,s,_,) +AU, (1 | Ska) 


This energy minimization problem can be solved by a simulated annealing proce- 
dure or a deterministic approximation of it. The main drawback of Bayesian methods 
is that simulated annealing procedures require an extensive amount of computation, 
whereas deterministic procedures may be caught in a local minimum of the cost 
function. An occlusion field has also been added to the formulation [Dub 93]. It 
has been shown that extensions of the Horn—Schunck solution using directional 
smoothness constraints can be seen as a special case (using deterministic constraints) 
of this Bayesian formulation. 


4.7 Transform-Domain Methods 


Transform-domain methods can be classified as phase-correlation methods [Kug 75, 
For 02] and space-frequency spectral methods [Hee 88, Bar 94]. 


4.7.1 Phase-Correlation Method 


The phase-correlation method, first proposed by [Kug 75], exploits the fact that 
translation in the image domain corresponds to a linear phase-shift in the 2D spatial 
frequency domain, since s, , ,(x) = s,(x + d(x)) implies 
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Satis de Or SB) (4.46) 


The linear term of the Fourier phase difference between two frames determines 
the motion estimate. The phase-correlation function is given by 


C Aff) = Sn (fr fa)5i Gv fr) (4.47) 
| Sn rf) 

where S,( f- f) is the 2D Fourier transform of the frame & with respect to the spatial 
variables x, and x,, and * denotes complex conjugation. The method is insensitive to 
frame-to-frame intensity shifts (bias or multiplication by a constant), since they do 
not affect the Fourier phase. 

If there is a translational motion between frames & and &+ 1, Che shiek = 
e 14f*4f) and the inverse 2D Fourier transform of (4.47) yields 


ĉipa ,72, | S dln, Ey d, s = d,] (4.48) 


which is an impulse whose location indicates the displacement vector (d,,d,). 


Maximum-Displacement Estimate/Block Size 


Since the DFT is periodic by the block size N, X N,, and the DFT of real images 
exhibit Hermitian symmetry, the maximum range of displacement estimates is lim- 
ited to [—(N,/2) + 1,N,/2] for N, even. For example, to estimate displacements 
within a range [-31,32], the block size should be at least 64 X 64. 


Boundary Effects 


In order to obtain a perfect impulse in the inverse 2D-DFT, the shift must be cyclic 
for each block. Since things disappearing at one end of a block do not generally reap- 
pear at the other end, the impulse will degenerate into a peak due to windowing with 
a rectangular kernel. 


Multiple Motions 


Experiments indicate that multiple peaks are observed if there are multiple motions. 
An additional search is required to find which peak belongs to which part of 
the block. 
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4.7.2 Space-Frequency Spectral Methods 


Space-frequency spectral methods exploit the fact that Fourier power spectrum of a 
translating image lies on a plane through the origin of the 3D spatio-temporal fre- 
quency domain. Thus, energy-based methods compute translational velocity by com- 
paring the output energy of a set of velocity-tuned filters [Hee 88, Fle 90, Bar 94]. 


4.8 3D Motion and Structure Estimation 


3D-motion/pose and structure (shape) estimation methods can be broadly classified 
as structure from motion (SFM) and structure from stereo (SFS) methods. SFM 
methods can be further classified as sparse structure from a set of feature correspon- 
dences and direct methods that estimate dense structure from intensity gradients 
(without explicit correspondence estimation). SFS (or from multiple views) is usu- 
ally preferred for dense structure (or depth map) estimation if stereo or multi-view 
video is available. These methods are applicable to both uncalibrated (for projective 
or affine reconstruction) or calibrated (for Euclidean reconstruction) cameras. 

Early works on SFM and SFS considered Euclidean reconstruction using cali- 
brated cameras only [Adi 85, Agg 88]. Given two calibrated cameras, their relative 
orientations can be determined from the epipolar constraint represented algebra- 
ically by the “essential matrix” E, which depends on the rotation R and translation t 
between the two cameras. Once the essential matrix is estimated from at least eight 
image-point correspondences, the 3D scene structure can be recovered (see (4.52)) 
relative to the coordinate frame of a reference camera (reference frame). 

Koenderink and Van Doorn [Koe 90] were first to propose solving the SFM prob- 
lem (for orthographic cameras) in two phases: i) first, reconstruct a unique (up to an 
arbitrary affine transformation) 3D scene representation, called the affine structure, 
from at least two views without calibration; ii) then, use available metric measure- 
ments (distances or angles) to uniquely determine the Euclidean structure. Faugeras 
[Fau 95] extended this approach to projective reconstruction, which is unique up to 
a projective transformation, from uncalibrated cameras, and proposed the stratifica- 
tion of the 3D reconstruction methods into projective, affine, and Euclidean stages, 
where projective or affine reconstructions may suffice for some robot vision and 
synthetic view synthesis applications without going through a laborious calibration 
process needed for Euclidean reconstruction. When calibration is not available, the 
epipolar geometry is represented by the “fundamental matrix,” which also incorpo- 
rates unknown camera calibration information. Projective and affine reconstructions 
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compute 3D coordinates of a sparse set of points relative to reference frames defined 
by five and four selected points in projective and affine spaces, respectively. In pro- 
jective and affine reconstructions, the unknown camera calibration matrices, where 
intrinsic and extrinsic camera parameters are allowed to vary from frame to frame, 
are folded into the overall projective or affine deformation (ambiguity) of the solu- 
tion with respect to the Euclidean solution. 

Literature on 3D motion and structure estimation is extensive; there are many 
papers, book chapters [Zha 04], and complete books [Har 04] on the subject. Our 
coverage here will be at the beginner level to introduce the main concepts and popu- 
lar methods. We start with the basics of camera calibration in Section 4.8.1. Sparse 
affine reconstruction using an uncalibrated camera is covered in Section 4.8.2. Sec- 
tion 4.8.3 treats uncalibrated sparse projective reconstruction. Euclidean reconstruc- 
tion using full or partial calibration is discussed in Section 4.8.4. We introduce a 
direct method for dense planar parallax estimation in Section 4.8.5. Finally, Section 
4.8.6 covers dense structure estimation from stereo (multi-view) images/video. 


4.8.1 Camera Calibration 


Camera calibration is an essential step to estimate metric (Euclidean) structure from 
video. It refers to estimation of intrinsic and extrinsic camera parameters (a total of 
11 free parameters) in the camera model (4.3). The intrinsic camera matrix K has 
5 degrees of freedom, where (%; 9>%2,9) denotes center of the image, s is the skew 
parameter, fis the focal length of the camera, and ki/k, denote the aspect ratio. The 
extrinsic camera parameters are the rotation matrix R with three degrees of freedom 
and the translation vector t with three parameters, which model the rotation and 
translation of the camera coordinate system with respect to the scene (world) coor- 
dinate system, respectively. 

Camera-calibration techniques can be classsified as pre-calibration methods 
using known reference objects and auto-calibration (or self-calibration) methods 
that only rely on point correspondences from an actual video scene without any ref- 
erence objects. Pre-calibration methods can utilize multiple images of a known 3D 
reference object [Tsa 87] or a planar target [Zha 00] from different viewpoints. Pre- 
calibration techniques can be classified as direct estimation of parameters vs. two- 
step estimation: first estimation of the camera projection matrix P = K[R|t] and 
then recovery of the intrinsic and extrinsic parameters from entries of the projection 
matrix. Pre-calibration should be preferred whenever possible, since self-calibration 
cannot always achieve the same level of accuracy as that of pre-calibration [Zha 04]. 
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In some cases, it is only possible to perform partial pre-calibration to determine 
the camera center, pixel aspect ratio, and skewness parameter, since the focal length 
and camera rotation and translation may vary during the actual recording (capture). 
A popular method for pre-calibration using a 3D reference object consists of four 
steps: 


1. Detect image feature points x, corresponding to 3D reference object points with 
known world coordinates X; 

2. Estimate the camera projection matrix P from at least six world-image point 
correspondence pairs by solving x = PX using the method of linear least squares 
(Appendix D); 

3. Compute the intrinsic and extrinsic camera parameters K, R and t as closed- 
form functions of entries of matrix P [Zha 04]; 

4. Refine K, R, and t through nonlinear optimization of 


mink rt | x,—P X, 二 
i 


starting with the estimates in step 3. Alternatively, we can first refine P through 
nonlinear optimization (i.e., complete step 4 right after step 2) and then deter- 
mine K, R, and t from refined P. 


A similar procedure to recover intrinsic and extrinsic camera parameters from a 
homography estimated by using eight world-image point-correspondence pairs has 
been proposed by Zhang [Zha 00, Zha 04] when a simpler planar (2D) test pattern 


is used. 


4.8.2 Affine Reconstruction 


Recall from Section 4.1.1 that the affine camera model (4.7) covers orthographic, 
weak-perspective, and paraperspective projection models and provides a reasonable 
approximation for imaging of distant or limited depth scenes. The affine structure 
reconstruction problem can be defined as [Koe 90]: Given at least five point corre- 
spondences between two views captured by an uncalibrated affine camera, defined by 
(4.7), arbitrarily choose four of these points that are not in a degenerate configuration, 
and reconstruct the fifth point (and any other points with known correspondences) 
in the affine coordinate system defined by these four points. The affine structure dif- 
fers from the Euclidean structure by an unknown affine transformation, which alters 
metric distances and angles, but preserves parallelism. This transformation can be 
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computed in the second step of stratification using camera-calibration parameters or 
some metric measurements about the scene to recover the Euclidean structure up to 
a scale factor. Koenderink-Van Doorn [Koe 90] offered a geometric solution to the 
two-view affine structure reconstruction problem. 


Multi-View Affine Reconstruction — Factorization Method 


Often we have more than two views of a scene and would like to estimate all camera 
matrices and affine structure of features at once. Tomasi—Kanade [Tom 92] proposed 
an elegant algebraic solution to this multi-view problem. Let’s assume we have cor- 
respondence information between JV feature points over M views denoted by Xp 
i=1,...,M,j=1,...,N. They made the observation that in the case of an affine 
camera and rigid motion, 2M X N measurement matrix W, which stacks 2D image 
coordinates of all corresponding points in successive views, has rank 3. This reduced 
rank of the measurement matrix W comes from the fact that position of feature 
points in the image plane is constrained by the rigid 3D motion. Furthermore, given 
Eqn. (4.7), it can be expressed as product of camera-motion matrix M and affine- 
structure matrix 4 as 


Xi, X12 Xin 
Ww = X3 X22 X Ny 
Xm Xm2 X MN 
2MXN (4.49) 
M, 
M 
= : Ky ee Mel = ee 
My 2Mx3 


where each camera matrix M, is 2 X 3, x, is 2X1, and X, is 3 X 1. Tomasi and 
Kanade [Tom 92] proposed to factorize the measurement matrix W using the 
singular-value decomposition (SVD) 


W =UDV’ (4.50) 


where M = U corresponds to motion (camera) matrices and 4 = DV? corresponds 
to a matrix formed by affine-structure vectors relative to a coordinate system cen- 
tered at one of these points. Hence, the multi-view affine-reconstruction method 
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can be summarized as: i) form the measurement matrix W, and ii) perform the SVD 
decomposition of W to recover M and X. 


4.8.3 Projective Reconstruction 


When 3D scene reconstruction is based on views recorded by uncalibrated perspec- 
tive camera(s), the resulting scene structure, called projective reconstruction, differs 
from the Euclidean geometry by an unknown projective transformation. This is due 
to the ambiguity stated by (4.7) that any pair of camera matrices P,H and struc- 
ture matrices H` !X yield the same projected image coordinates and are projectively 
equivalent. This unknown projective transformation can be recovered in the second 
stage (see Section 4.8.4) from camera-calibration information up to a scale factor. 
Here, we discuss projective reconstruction from two views and multiple views. 


Two-View Projective Reconstruction — Epipolar Geometry 


3D reconstruction of a point X, given its image-plane coordinates x, , and x), on two 
views is based on epipolar geometry, which states the lines joining x; with its camera 
center O, and xzj with its camera center O, intersect at the point X,, or alternatively 
the line from O, to x, the line from O, to x,,, and the line from O, to O, are all 
co-planar as depicted in Figure 4.21. 

Fundamental matrix F captures this epipolar geometry between two views, 
which only depends on camera parameters and pose, in an algebraic expression in 
the homogeneous coordinates, given by 


x, Fx,, =0, 7=1,....N (4.51) 







Scene (world) 
coordinate system 





Figure 4.21 Two-view epipolar geometry between converging projective cameras. 
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It can be shown that the fundamental matrix can be expressed in terms of the 
x SR. a i = = S R 
intrinsic and extrinsic camera parameters as F = (K; ) EK; , where E =t, R is 
called the essential matrix and 


0 —h & 
=| & 0 Te 
=f th 0 


When we do not have access to the internal calibration parameters K, and K,, 
the reconstruction obtained from F will be up to an unknown projective transfor- 
mation H`! of the actual Euclidean structure. A complete algorithm for two-view 
reconstruction of the 3D projective structure can be summarized as: 


1. Pixel correspondences: Find at least N= 8 pixel correspondences between two 


views. 
2. Normalization [Har 97]: Normalize the pixel coordinates % g = Lx, = 1,2, 
j=1,...,.N, where affine mappings T, are computed as follows: 


a. Shift the origin of the coordinates to center at the mean x, aLa X; 
i= 1,2. 
b. Scale the shifted pixels, x, = (x, —x,) such that their root mean-squared 
distance from the origin is V2. 
3. Fundamental matrix: Given normalized correspondences, estimate the funda- 


mental matrix F. 


a. Let Xi = [zw vj w], i=1,2, j=1,...,N; set up an equation Af =0, 
where f =[F, É... Ey] is a9 X 1 vector containing entries of the 3 X 3 
fundamental matrix F, and row j of the NX 9 matrix A is of the form 


[ujj Uaj Vj WwW Vaj; VV VW Wy jt, Waj Wwij] 


lj 


We note that the rank of matrix A must be 8 for a unique solution to exist. 
b. The non-trivial solution to this set of homogeneous equations can be 
found by solving 


min ||A¢| 


subject to ||f|| = 1 
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The solution is the eigenvector of ATA corresponding to the smallest eigen- 
value (see Appendix D). $ 
The fundamental matrix F before normalization is given by F = T; FT,. 


C 
4. Camera projection matrices: Given F, recover camera matrices P, = [I 0] and 
P = [Rt]. 
a. Compute the singular value decomposition F = UDV". Since F has rank 2, 


it should have two non-zero singular values D ~ diag(a, b, 0) in the absence 
of noise. 


. Observe that [Har 04] 


F =(UZU")(UY'DV') =S M, 


where 


and 
D= diag(a, b,c) 


The value of c can be set arbitrarily; a common choice is c= (a + 6)/2. 
Then, the camera matrices are P, = [I 0] and P, = [M u,], where us is 
the third column of U. Note that u; F=0, i.e., u, is the generator of the 
left null-space of F. 


5. Triangulation: The lines joining x, and x,, with their respective camera centers 
may not intersect in 3D due to estimation inaccuracies, hence we estimate 3D 


points X, by solving 


Min, | Xj —PX, ||? + | X37 PX, I 


which is robust to noise [Har 04]. This step requires an iterative optimization 


procedure. 


We discuss the case of 3D Euclidean reconstruction when the cameras are cali- 


brated, i.e., K, and K, are known, in Section 4.8.4. 
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Multi-View Reconstruction — Projective Factorization 


Often we have more than two views of a scene and would like to recover all camera 
matrices and 3D structure of feature points at once. To this effect, Tomasi—Kanade 
factorization for affine reconstruction has been extended for projective reconstruc- 
tion by Sturm—Triggs [Stu 96] and others [Ole 07]. The projection equations (4.6) 


for M views and N feature points can be written in vector-matrix form as 


AX oža “e AwXw 
O= Ay: X21 Nx u NX 
ÀmXm Amm `| ÀmNXmN ae (4.52) 
P, 
P, 
SI [X, Xy m Kilo 
Py 3Mx4 


where the scale factors À are called projective depths. Note that it is required that 
every point is visible in every view (no occlusions), so the observation matrix O has 


no missing entries. We can express the matrix O as Hadamard (element-wise) prod- 
uct © of the matrices 


A, Ap Aw 
O=AOW, where A = An An ig Nw and 
Am Àm2 Aun 
Xi, Xp Xin 
w=) a *z Xy 
Xm Xm2 °° Xun 


The projective factorization aims to solve for the unknown projective depths A, 
the unknown cameras P, and the unknown structure X simultaneously. Inspection 
of the right-hand side of (4.52) reveals that the matrix O must have rank 4. However, 
since the projected depths A, are unknown, this rank-4 property is not satisfied if 
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they are set arbitrarily; hence, we need an accurate initialization of projected depths. 
Sturm—Triggs [Stu 96] proposed an iterative algorithm to solve the problem by alter- 
nation: i) given A, solve for P and X by SVD factorization; ii) given P and X, solve 
for A by least-squares fitting; and alternate between these two steps until conver- 
gence. In their original paper, Sturm—Triggs initialize projective depths from succes- 
sive two-view reconstructions. Several generalizations of the original method share 
the following steps: 


1. Form the homogeneous observation matrix W. Initialize A®, such that all 
A =1 
j ; 
2. Repeat for k= 0,...,N. 


a. Compute the closest rank-4 matrix O® to A®OW by singular value 
decomposition. Let A®OW = UDVT. Define D to be the diagonal matrix 
obtained by keeping the first four (largest) diagonal entries of D and setting 
the rest equal to zero. Then, O® = UDV". 

b. If k= N go to step 3. Otherwise, compute new matrix A“*”) of weights 
Vai so that A“*” OW is as close as possible to O under Frobenius 
norm. 

3. Compute the factorization O™ = P X to obtain P and X that provide the cam- 
era matrices and point locations, respectively. 


It has been noted that the Sturm—Triggs method with the above initialization 
works well if the true solution is close to its affine approximation [Har 04]. Dai et al. 
[Dai 10] proposed a non-iterative element-wise factorization that can also handle 
missing data in the measurement matrix. 


Bundle Adjustment 


Since the observed feature-point correspondences x,, in the two-view and multi- 
view problems have limited accuracy, the projection equations A ,, x; = P,X cannot 
be satisfied exactly. Therefore, projection matrices P, and 3D points X, estimated 
by using simple linear estimation procedures discussed above may contain arbitrary 
errors. The maximum likelihood (ML) estimation of P, and X, under the assump- 
tion that measurement errors are Gaussian and independent of each other, requires 


the solution of the following nonlinear least-squares problem: 


minp x, 2 E | =R xf (4.53) 
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which is generally referred as the “bundle adjustment” method. A popular iterative 
algorithm to minimize this cost function is the Levenberg—Marquardt method [Tri 
99, Har 04]. In order to converge to the globally optimal solution, it requires a good 
initial solution, which can be found by applying the above two-view or multi-view 
solutions. 


4.8.4 Euclidean Reconstruction 


The classical approach to Euclidean reconstruction requires knowledge of the camera- 
calibration matrices. In the two-view case, given the calibration matrices K, and K,, 
we can first compute 


Hj = Ky Hye f= 12 (4.54a) 


Then, Eqn. (4.51) can be rewritten as mh; yee =0, which can be 
re-expressed as 


Xj EX, =0,j7=1,....N (4.54b) 

Hence, the essential matrix E and the 3D Euclidean reconstruction (up to a 
global scale) can be recovered from the scaled pixel correspondences (4.54a) directly 
using the procedure described in Section 4.8.3. 

Alternatively, in the stratification approach [Fau 95], a projective reconstruction 
is first computed from uncalibrated pixel correspondences, and is then upgraded to 
an affine or a Euclidean one by incorporating ground-truth or camera-calibration 
information. This can be implemented in different ways given different information: 


1. Camera calibration: Projective ambiguity can be resolved up to a global-scale 
factor given the camera matrices. 

2. Partial camera calibration: Constraints, such as known focal length, or 
that cameras have the same internal parameters, may be enforced during 
bundle-adjustment. 

3. Auto-calibration: Automatic methods for auto-calibration often compute an 
affine reconstruction first, followed by an update to a Euclidean reconstruction 
and full determination of the camera-calibration parameters [Har 04]. 

4. Ground-truth 3D coordinates: Knowledge of 3D Euclidean coordinates of at 
least five ground-control points is required [Har 04]. 
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4.8.5 Planar-Parallax and Relative Affine Structure 
Reconstruction 


The planar-parallax framework, introduced in Section 4.2.2, enables reconstruction 
of a dense scene structure (height of pixels) relative to a reference plane in the scene 
from two or multiple frames. First, a planar surface that is visible in all views is 
identified. Then, homography between each view and a reference view is estimated 
successively (Section 4.5.4), and all views are registered with respect to the reference 
view to compute the residual (planar parallax) motion fields at each frame. Here, we 
discuss direct estimation of the unknown motion parameters, which are the epipoles 
(one for each frame) and dense structure parameters, i.e., the height of each pixel 
with respect to reference plane at every frame, in the residual motion model (4.7) 
from uncalibrated cameras. The iterative direct-estimation method is based on the 
“the epipolar brightness constraint” [Ira 02], which is obtained by substituting the 
components of the planar-parallax flow (4.7) (assuming the time interval between 
two frames is one) 


À y(x, k) 

= (t,x 一 天 

1 lw 1 (1+ y(x,h)) ts, ( 3k**1 ie) 
A p: 

Ug = Rg, yg ee (454. —to,) 


(1+ y(x,))¢;, 


into the optical flow equation (4.24), relating frame s(x, 月 to the reference frame 
5y (x); 


0 
(tsexi — te) + 2 








y(x, k) pm (t,x, —t,,)+ +A, (s(x,k))=0 (4.55) 


(1+ y(x, &))t,, Ox, NA 


where 


ðs (x) , ðs (x), 
x; w 7 X2 w 
> a OR 





AGER =) 


We have one constraint equation for each pixel for a total of LN equations, but 
there are total of L(N + 3) unknowns, where L and N denote the number of frames 
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and number of pixels per frame, respectively. Hence, we require a local smoothness 
constraint on the shape parameters, whereby y is assumed to be constant within a 
local (5 X 5) window B (see step 3a below) about each pixel, so that the problem is not 
under-constrained. The complete algorithm can be summarized as follows [Ira 02]: 


1. Construct a multi-resolution pyramid for each frame s(x, &) and the reference 
frame s(x). 

2. Start with the coarsest resolution level. Initialize structure parameters 
(20, 5 Xk) =0 for all (x,,x,) at each frame & and the motion parameter 
t® =[0 0 1] for each frame &. 

3. Refine structure and motion parameters iteratively at the current resolution 
level. Set j= 1. 


a. For all frames k, given epipole t\/’, estimate the structure for each pixel 
(xX) by minimizing: 


E(y"? (%,,%))) = 


2 


Ds 


k (x,,x )EB(X 3) 





a a 
{38 (t,x, —ty) +—" (typ — t) +(1+y)t,,A, (s(x,k)) 
Ox, Ox, 





b. Given the scene structure parameters for all pixels, estimate the position 
of the epipole for each frame with respect to the reference frame. For each 
epipole £, minimize 





E(t) = 
a a i 
>; |v, 7 3 —ty) +5 bets =) +A + y)ts,A,(s(x,)) | 
(x1,x2) x) x, 





where W,,¥, and partials are functions of (x,,x,) and the weights 
W,= 1/[(1 + y)2,,I. 
c. Set j=j+ 1, and repeat (a) and (b) five times at each resolution level. 
4. Repeat step 3 at the next higher resolution level (until the original resolution) 
using parameter estimates from the previous level as initial estimates. 
5. The final output is the structure and motion parameters at the original resolu- 
tion of input views and the residual parallax-flow field (v,, v,) synthesized from 


them using (4.7). 
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The risk of getting stuck at a local minimum is significantly reduced by using 
a coarse-to-fine estimation strategy and the limited search space (three parameters 
per frame for global-motion estimation and ove parameter per pixel for local cor- 
respondence and shape estimation [Ira 02]. We note that the precision of shape and 
motion estimation relies on accuracy of the alignment of input views with respect to 
the reference planar surface. 

The relative affine-structure framework [Sha 96] is closely related to the planar- 
parallax framework. The magnitude of parallax displacement of points between two 
views relative to a planar surface in the scene is called “the relative affine structure.” 
The relative affine structure depends both on the “height” of a 3D point X from the 
planar surface in the scene (reference plane) and its “depth” relative to the reference 
camera (reference frame). 


4.8.6 Dense Structure from Stereo 


Applications such as image-based rendering of intermediate views for multi-view 
video require dense scene structure (depth image) reconstruction. Dense-structure 
estimation is not always possible with high precision using structure from motion 
algorithms from monocular video since presence of multiple motions by indepen- 
dently moving objects complicates the solution. Structure from stereo (two-views) 
with calibrated cameras is perhaps the most robust approach to reconstructing a 
dense scene structure because the epipolar geometry holds true for all pixels. 
Structure from stereo (SFM) algorithms typically start with camera calibration 
and image rectification. Rectification refers to aligning two cameras to be co-planar, 
parallel to the line joining the two camera centers. This is achieved by applying a 
homography to both images. After image rectification, the correspondence (dispar- 
ity) search is simplified to a 1D search on a horizontal line since all epipolar lines 
are parallel in the rectified image plane. Motion-estimation methods discussed in 
Section 4.5 or Section 4.6 can be used for 1D disparity estimation. A comparative 
evaluation of more than 20 stereo-matching algorithms is given by Scharstein and 


Szeliski [Sch 02]. 
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Exercises 
Problem Set 4 


4.1 Discuss the conditions under which the weak-perspective projection provides 
a good approximation to imaging through an ideal pinhole camera. 


4.2 Homography: 
a. Derive the homography equation (4.14) for the case of relative 3D 
rigid motion between a planar surface and a camera under perspective 

projection. 
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4.3 


4.4 


4.5 


4.6 


4.7 


4.8 


4.9 


4.10 


b. Show that Eqn. (4.14) is also valid for imaging an arbitrary rigid surface 
when a camera is only allowed to rotate about its x-, y-, and/or z-axis (no 


camera translation). 


Show that the model (4.18) can be obtained by linearization of the homog- 
raphy given by (4.14). 


What are the conditions for the existence of normal flow given by Eqn. (4.25)? 
Can we recover optical flow vectors from the normal flow? Discuss the rela- 
tionship between the spatial-image gradients and the aperture problem. 


For a color image, the optical-flow equation (4.23) can be written for each 
of the R, G, and B channels separately. State the conditions on the (R, G, B) 
intensities so we have at least two linearly independent equations at each 
pixel. How valid are these conditions for general color images? 


Derive the optical-flow equation (4.23) starting from the displaced frame dif- 
ference equation (4.26). Why do we need the small motion assumption for 
the optical-flow equation to be valid? 


Differential motion-estimation methods can estimate small displacements. 
Why? Explain how hierarchical-motion estimation and the iterative-refine- 
ment scheme (see Section 4.4.1) help deal with larger MVs when using the 
optical-flow equation for motion estimation. 


Explain how hierarchical-motion estimation helps to alleviate the aperture 
problem in motion estimation. 


How can we detect occlusion areas in forward- and backward-motion estima- 
tion? Discuss. 


Derive equations for direct estimation of affine-motion model parameters 
from image-intensity gradients without using point correspondences. 


Derive equations for direct estimation of homography from image- 
intensity gradients without using point correspondences. (Hint: see 


Appendix in [Bak 04].) 
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Discuss how coordinate normalization should be implemented to obtain a 
stable solution for Eqn. (4.41). (Hint: See [Har 97].) 


In the phase correlation motion-estimation method: 


a. Discuss why we observe a peak rather than an impulse when we compute 
Eqn. (4.48). 

b. Discuss why the range of displacement estimates is limited to 
[(—N/2) + 1, N/2], given the DFT size is NX N, for N even. 


MATLAB Exercises 


4.1 Dense-motion Estimation — Optical Flow 


a. Write a MATLAB program to compute the spatial partials 6s/6x, and 


6s/6x, of an image 
i. using finite differences, 
ii. using the derivative of the Gaussian filter. 

. Write a MATLAB program to find the Lucas—Kanade motion vectors 
between two frames using 8X8 non-overlapping blocks and block- 
translation motion model. (Note: Do not forget to put some control state- 
ment in your program that checks the rank or the condition number of the 
2 X 2 matrix before inversion.) 

i. Which frame do you use to compute the spatial partials? Explain. 

ii. Display calculated motion vectors. Are there any outlier motion vectors? 
Discuss. 

iii. Warp the reference frames toward the current frame using the computed 
motion vectors. Display the motion-compensated frame difference. 
Comment on the result. 


. Repeat (b) using a three-level hierarchical representation of each frame and 


computing four iterations at each level of the hierarchy. Compare results 
with those in 2(b) and 2(c), and discuss the quality of the results. 


4.2 Block-Matching Motion Estimation 


a. Design and implement the full-search motion estimation and at least one 


fast-motion estimation (preferably logarithmic search) method. Compare 
the computational load as well as the quality of both motion estimation 
algorithms. 


b. Repeat using three-level hierarchical representation of each frame. Note 


that motion vectors from the lower resolution level multiplied by 2 in each 


Exercises 


267 


dimension will serve as initial estimates for the search at a higher resolution 
level. Discuss the advantages of hierarchical block-matching. 

Your report should include a thorough discussion of your implementa- 
tion and results: You should also provide a visual representation of motion 
vectors. The comparison of the full-search motion estimation and a fast- 
motion estimation method of your choice must include MAD or MSE 
values and a comparison of computational load (number of multiplication, 
addition, and compare operations). 


4.3 Parametric-Motion Estimation and Image Registration 


Given two images of a distant static scene: 


a. 


Affine Motion Model: Compute the spatial partials 6s/5x, and 6s/6x, and 

the temporal partial 5s/5¢ as in Problem 4.1. 

i Form the matrix (4.38) over the region of overlap between two images 
and solve Eqn. (4.39) to calculate six affine parameters to register the 
two images. 

ii. Use the calculated parameters to warp the current image toward the 
reference image using the imtransform function in MATLAB to create a 
panaromic image. 

iii. How do you combine pixel intensities in the overlapping region 
between the two frames? 

Pseudo-Perspective Motion Model: Repeat part (a), this time using an 

eight-parameter pseudo-perspective motion model in creating an image 

mosaic of the scene. Compare the results of (a) and (b), and write your 
conclusions in the report. 

Your report should include the two images stitched together (the mosaic 
representation) obtained by both methods and a comprehensive discussion 
of how you evaluate which method provides a better result. 


4.4 Homography Estimation by Feature Matching 
Two images are related by a homography if they are images of a static scene 
captured by rotating the camera about a fixed center of projection possibly 
including zoom (but no camera translation) or they are images of a planar 
object/scene. Capture at least two images satisfying these requirements. 


a. 


b. 


Select or compute a set of N> 10 putative-point correspondences between 
two images. 


Normalize feature correspondences using a similarity transformation (see 


Section 4.5.5). 
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Estimate the homography between two images using the singular value 
decomposition (SVD) method. 


. Warp one image onto the other using the estimated transformation using 


the maketform and imtransform functions in MATLAB. 


. You can now stitch these images together to form a mosaic as in the previ- 


ous exercise. 


4.5 Estimation of Fundamental Matrix and Triangulation 
The fundamental matrix captures the epipolar geometry between two projec- 


tive views in algebraic form. Capture at least two images of a static 3D object. 


a. 


b. 


Use a test target to obtain the camera calibration matrix K. How do you 
validate the calibration parameters? 

Select or compute a set of N > 10 putative point (on the object) correspon- 
dences between two images. 


. Estimate the fundamental matrix and the projective camera matrices P, 


and P, for the two views. How do you validate the camera matrices? 


d. Estimate the 3D geometry of the feature points in the projective coordinates. 


. Estimate the 3D Euclidean geometry of the feature points. Your report 


should include the camera calibration parameters, camera matrices, and 3D 
plots of both the projective and Euclidean geometry of the feature points. 


MATLAB Resources 


Michael J. Black, Optical Flow Software (C and MATLAB) 
http://cs.brown.edu/~black/code.html 


Jean-Yves Bouguet, Camera Calibration Toolbox for Matlab 
http://www.vision.caltech.edu/bouguetj/calib_doc/ 


A. Zisserman, MATLAB Functions for Multiple View Geometry 
http://www.robots.ox.ac.uk/-vgg/hzbook/code 


Peter Kovesi, Model Fitting and Robust Estimation 
http://www.csse.uwa.edu.au/~pk/Research/MatlabFns/index.html#robust 


CHAPTER 5 


Video Segmentation and 
Tracking 





Video segmentation refers to partitioning images (frames) or video into spatial, 
temporal, or spatio-temporal regions that are homogeneous in some feature space. 
As with any segmentation problem, effective video segmentation requires proper 
feature selection and an appropriate distance measure. Temporal-segmentation 
partitions video ihto shots based on similarity of frames. Spatial-segmentation 
partitions each video frame into homogeneous regions, which can be achieved by 
segmenting each frame individually based on color similarity (intra-frame image 
segmentation). Spatio-temporal segmentation yields temporally connected spatial 
segments or object trajectories, for example, by inter-frame segmentation (based on 
similarity of color and motion between frames) or color and motion tracking. 


Spatial segmentation helps identify foreground and background objects, as well as 
color and motion boundaries and occlusion regions. Spatio-temporal segmentation 
aims to obtain temporally linked spatial regions (objects) over long video sequences 
(multiple frames). Motion or object tracking is a popular and important special case 
of spatio-temporal segmentation, where a specific object is segmented in space-time 
by causal processing, i.e., given the segmentation map at frame &, find the map at 
frame k+1 that is consistent with the result at frame k. Tracking methods may be 
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based on color, motion, or color and motion. Video segmentation is an integral part 
of many video analysis and coding tasks, including: i) advanced image/video coding 
and rate allocation, ii) improved motion estimation, iii) 3D motion and structure 
estimation with multiple objects, iv) video surveillance/understanding, v) video 
indexing and summarization [Dim 02], and vi) video authoring and editing. 

Different features and homogeneity criteria may lead to different segmentations 
of the same video, e.g., color, texture, or motion segmentation [Cas 98]. Further- 
more, there is no guarantee that any of the resulting automatic segmentations will be 
semantically meaningful, since a semantically meaningful region may have multiple 
colors, multiple textures, and/or multiple motions. Although semantic objects can 
be computed automatically in some well-constrained settings, e.g., when an object 
moves against a stationary background, in general semantic object segmentation 
requires specialized capture methods (chroma-keying) or user interaction. Specific 
video-segmentation methods should be considered in the context of the require- 
ments of the application in which they are used. Factors that affect the choice of a 
specific segmentation method include [Cor 04]: 


。 Precision of segmentation: If segmentation is employed to improve the com- 
pression efficiency or rate control, then certain misalignment between segmen- 
tation results and actual object borders may not be of big concern. On the other 
hand, if segmentation is needed for object-based video authoring/editing or 
shape similarity matching, then it is of utmost importance that the estimated 
boundaries align with actual object boundaries perfectly, where even a single 
pixel error may not be acceptable. 

。 Complexity of content: Complexity of content can be modeled in terms of 
amount of camera motion, color and texture uniformity, contrast between 
objects, smoothness of motion, objects entering and leaving the scene, regular- 
ity of object shape along the temporal dimension, frequency of cuts and special 
effects, etc. Clearly, more complex scenes require more sophisticated segmenta- 
tion algorithms, e.g., it is easier to detect cuts than wipes or fades. 

。 Real-time performance: If segmentation must be performed in real-time, e.g., 
for rate control in videoconferencing, then fully automatic algorithms must 
be used. On the other hand, one can employ semi-automatic, interactive algo- 
rithms for off-line applications such as video editing, indexing, or off-line video 
coding, or to obtain semantically meaningful segmentations [Izq 02]. 


This chapter presents several video-segmentation methods ranging from 
simple shot-boundary and motion-detection techniques to sophisticated 
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motion-segmentation, interactive object segmentation, and tracking methods. 
Although multi-modal signal-processing methods have been shown to be effec- 
tive in some applications [Wan 00], here we only cover video modality. We start 
with image-segmentation methods in Section 5.1. Scene change (shot boundary) 
detection for temporal-video segmentation and motion-detection/background 
subtraction methods, where we study both frame differencing and pixel-based 
multi-frame background-modeling methods, are covered in Section 5.2. Motion- 
segmentation methods are discussed in Section 5.3. We begin with the dominant- 
motion approach, which labels independently moving regions sequentially (one at 
a time). We then present multiple-motion segmentation methods, including clus- 
tering motion parameters, maximum-likelihood segmentation, maximum a poste- 
riori probability segmentation, and region-labeling methods. Simultaneous motion 
estimation and segmentation is discussed next, since accuracy of segmentation 
depends on the accuracy of the estimated motion field and vice versa. A discussion 
of semantically meaningful object segmentation with emphasis on chroma-keying 
and semi-automatic (interactive) object segmentation/tracking methods is also 
included. Section 5.4 provides an overview of object-tracking methods, which can 
be considered as spatio-temporal-object segmentation over long video sequences. 
Image and video matting is discussed in Section 5.5. Finally, performance evalua- 
tion of video-segmentation and tracking methods is treated in Section 5.6. 


5.1 Image Segmentation 


Segmentation groups neighboring pixels in an image or video frame together based 
on similarity of color, texture, and/or shape cues [Har 85, Pal 93]. We start by sim- 
ple thresholding and clustering methods, which do not impose spatial-connectivity 
constraints on segmentation labels, in Section 5.1.1 and Section 5.1.2, respectively. 
The maximum a posteriori probability (MAP) estimation and graph-based methods, 
which do impose spatial-connectivity constraints, are covered in Section 5.1.3 and 
Section 5.1.4, respectively. Finally, we introduce active-contour models in Section 
5.1.5. The reader is referred to [Sal 99, Gon 07] or other books for mathematical 
morphology-based methods. 


5.1.1 Thresholding 


Thresholding is a popular tool for image segmentation [Har 85]. Consider an image 
s(x,,x,) composed of a light object on a dark background. Such an image has a 
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Figure 5.1 Bi-modal histogram. 


bi-modal histogram A(s) as depicted in Figure 5.1. An intuitive approach to seg- 
ment the object from the background, based on gray-scale information, is to select a 
threshold T that separates these two dominant modes (peaks). The segmentation of 
an image into two regions is also known as binarization. 

The segmentation mask or the binarized image identifies the object (denoted by 


1) and background (denoted by 0) pixels: 


eno) -| 1 (ST (5.1) 
0 otherwise 

Thresholding techniques can be divided into two broad classes: global and local. 
In general, threshold Tis a function of T= T(x,,x,, s(x,,x,),p(x,,x,)), where p(x,, x3) 
is some local property of the pixel, such as the average intensity of a local neighbor- 
hood. If T is selected based only on the statistics of pixel values s(x,,x,) over the 
entire image, it is called a global threshold. Alternatively, T can be a global threshold 
selected based on the statistics of both s(x,,x,) and p(x,,x,) to reflect certain spatial 
properties of the entire image. If, in addition, it depends on (x,,x;), it is called a 
dynamic or adaptive threshold. Adaptive thresholds are usually computed based on 
a local sliding window about (x,, x). 


Finding the Optimum Threshold(s) 


Several threshold determination methods have been proposed that are discussed 
in a number of review papers [Lee 90, Sez 04]. Perhaps the most popular thresh- 
old determination method for a bi-modal histogram /(s) (where we have two 
classes of pixels in an image s(x,,x,), e.g., foreground and background) is the Otsu 
method [Ots 79], which finds the optimum threshold separating two classes so 
that within (intra)-class variance is minimal and the between (inter)-class variance 
is maximal. 
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The within-class variance ow as a function of the threshold & is defined by a 
probability weighted sum of the variances of two classes al and o>, given by 


o? (k) = B(k)o? + P,(k)o? 5.2) 


where P (k) = 5 , p(i) is the probability that two classes are separated by a thresh- 
old k, and p(i) is the value of the ‘th bin of the normalized histogram, P,(&)=1—P, (A). 
Otsu shows that minimizing the within-class variance is equivalent to maximizing 


the between-class variance: 
T3 (k) = T° — o (k) = P(A)P(R) Loy (k) — u, (k)? (5.3) 


where o° is the combined variance, and the class means a(k) and u,(k) are 


tipli) S< i p(i) 
ie od ies A 
2 P (k) ; a P, (k) 
and L—1 is the maximum gray level in the image. The Otsu algorithm, which per- 
forms a brute force search, can be summarized as follows. 


Otsu-Threshold Algorithm [Ots 79] 


1. Compute the normalized histogram p(i), i= 0, ..., L—1 with L levels. 
2. Step through all possible thresholds k= 0, ..., L— 1 


a. Compute P,(&), P(A), and u, (k), m,(k) 
b. Compute os(k)= P(k)P,(A)[m, (4) — (ADP 


3. The Otsu threshold is given by & = arg MaXy<,<) 1 Oo; (k). If the maximum is 
not unique, then k* is given by the average of & values corresponding to mul- 
tiple maxima. 


In some applications, the histogram has K > 2 significant modes (peaks). Then, 
we need K—1 thresholds to group pixels into K segments. The extension of the Otsu 
method to multi-level thresholding is referred to as the multi-Otsu method [Lia 01]. 
Of course, reliable determination of thresholds becomes more difficult as the num- 
ber of modes increases. 


5.1.2 Clustering 


In image segmentation by clustering, it is expected that feature vectors from 
similar-appearing image regions will form clusters in the feature space. If we consider 
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segmentation of an image into K classes, then the segmentation label field, 2(x,,x,), 
assumes one of the K values at each pixel, i.e., z)h /=1,...,K. In the case of 
scalar features, such as pixel intensities, clustering can be considered as a method of 
determining the K—1 thresholds that define the decision boundaries in the 1D fea- 
ture space. With M-dimensional vector features, segmentation corresponds to parti- 
tioning the /-dimensional feature space into K disjoint regions. 


K-Means Algorithm 


A standard procedure for clustering is to assign each sample to the class of the nearest 
cluster mean [Col 79, Lim 90]. In the unsupervised mode, where the cluster means 
are unknown, this can be achieved by an iterative procedure, known as the K-means 
algorithm. In the following, we describe the K-means algorithm assuming that we 
wish to segment an image into K regions based on the gray values of the pixels. Let 
x = (x,,x,) denote the coordinates of a pixel and s(x) its gray level. The K-means 
method aims to minimize the performance index: 


je l= T (5.4) 


where A) denotes the set of samples assigned to cluster /after the ith iteration, and 4; 
denotes the mean of the /th cluster. The index / measures the sum of the distances of 
each sample from their respective cluster means. The K-means algorithm usually con- 
verges to a local minimum of the index J, hence different initializations may result in 
different segmentation results. The K-means algorithm can be summarized as follows: 


1. Choose K initial cluster means, u®, w,..., 4, arbitrarily. 
2. At the ith iteration assign each pixel, x, to one of the K clusters according to the 
relation 


xE A? if |[s(x)— || <||s(x)— || for all / =1,2,...,.K,l#j7 


where A\) denotes the set of samples whose cluster center is u}. That is, assign 
each sample to the class of the nearest cluster mean. 
3. Update the cluster means 从) as the sample mean of all samples in A‘, 


J=1,2, ...,K 


nfo >, se) T=1.2... a 
N, xen? 


where N, is the number of samples in A\). 
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4. If w’* = u? for all =1,2,...,K, the algorithm has converged, and the pro- 
cedure is terminated. Otherwise, go to step 2. 


Example. Let K=2. The procedure starts with specifying two cluster 
means, 2, and 1,, arbitrarily. Then, all feature points that are closer to p, 
are labeled as “z=1” and those closer to yz, are labeled as “z=2.” Next, the 
average of all feature points that are labeled as “1” gives the new value of u, 
and the average of those labeled as “2” gives the new value of w,. The proce- 
dure is repeated until convergence. 


The biggest challenge in K-means clustering is the determination of the correct 
number of classes, which is assumed known. In practice, the value of Kis determined 
by trial and error. Different values of K can be tried until a desired clustering qual- 
ity is achieved. To this effect, measures of clustering quality have been developed, 
including the within-cluster and between-cluster scatter measures [Col 79, Fig 02]. 
Although we have presented the K-means algorithm here for the case of scalar pixel 
features, it can be straightforwardly extended to the case of vector pixel features and/ 
or region-based image features. The procedure provides cluster means, which can be 
used for other applications, such as vector quantization. 


Mean-Shift 
Mean-shift (MS) is a mode-finding algorithm [Che 95]. Unlike K-means, the MS 


method does not assume prior knowledge of the number of clusters that makes it 
ideal for unsupervised clustering. The main idea is as follows. For each data point, xX, 


1. Take a window with the bandwidth parameter 4, containing n samples, cen- 
tered around the data point x.. 

2. Compute the weighted mean (center of mass or centroid), using a kernel g(-) or 
weighting, of data within the window 


sadfa) 


3. Shift the center of the window to the new mean and repeat the procedure until 
convergence. 
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The MS is the difference m(x,)—x; between the weighted mean and the current 
center of the kernel. The stationary points of the algorithm are modes of the density 
function. All points associated with the same mode belong to the same cluster. Typi- 
cally, MS is run at each data point or sometimes at points that are selected uniformly 
from the feature space [Com 02]. 

Combining the three steps above and assuming a Gaussian weighting, i.e., 
g(x)=e *, the mode I; for each data point x, can be computed via the following 











iterations: 
x 
A h 
(k+1) i=1 Xi€ k =12 (5.5) 
j w | > pees 
|= 
a a 
i1? 

where y = x, is the initial center of the window (centered at the data point). It has 


been shown that the MS algorithm is an adaptive gradient ascent method, which is 
guaranteed to converge to a point (mode) where the distribution has zero gradient 
[Com 02]. Thus, MS steps are large in regions where data points are sparsely popu- 
lated, and the steps are smaller near modes. 

Although MS is a non-parametric algorithm, it requires the bandwidth param- 
eter / to be estimated. The choice of bandwidth influences the convergence rate and 
the number of clusters. A small 4 can slow down convergence and may result in too 
many clusters. A large 4 can speed up convergence but might merge some modes. A 
popular solution is to use adaptive MS where the bandwidth (size) parameter / varies 
for each data point and is calculated using the k-nearest neighbor method. If x, , is the 
kth-nearest neighbor of x, then the bandwidth is calculated as 


h, =| 





x, — xal (5.6) 


A comparison of MS with K-means shows that K-means is sensitive to initializa- 
tion. A wrong initialization can delay convergence or even result in wrong clusters, 
whereas MS is fairly robust to initialization. Likewise, K-means is sensitive to outli- 
ers, while MS is not. However, in higher dimensional clustering problems, the num- 
ber of local maxima may be large and MS might not work well. 

Clustering methods do not impose spatial-continuity constraints on the esti- 
mated segmentation labels to ensure spatial connectivity of the segments. The 
Bayesian segmentation, which is treated next, can be considered as clustering with 
statistical spatial-connectivity constraints. 
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5.1.3 Bayesian Methods 


Similar to Bayesian motion estimation (Chapter 4), Bayesian segmentation methods 
model uncertainty in the data and prior knowledge about the desired segmentation 
(e.g., spatially connected segments) in terms of probability density functions (pdf), 
and are usually formulated as an energy minimization problem that can be solved by 
nonlinear optimization. This section presents the maximum 4 posteriori probability 
(MAP) approach, which belongs to the class of Bayesian methods. The difficulty 
with Bayesian methods is in specifying realistic probabilistic models and solution of 
the resulting nonlinear optimization problem. Numerical optimization procedures 
that can reach the global optimum are often time consuming (see Appendix C) and 
faster greedy methods can be trapped in a local minimum. 


Maximum A Posteriori Probability Method 


The MAP formulation takes the presence of observation noise into account explic- 
itly by modeling observed image data as g(x)=s(x)+v(x), where v(x) denotes the 
observation noise. In vector notation, g denotes an /V-dimensional vector obtained 
by lexicographical ordering of the monochrome image data. We wish to esti- 
mate a segmentation label field, denoted by the N-dimensional vector z. A label 
2(x) = /, /=1,2,...,K implies that the pixel (site) x belongs to the /th class among 
K classes. The desired estimate of the segmentation field, z, is defined as the one 
that maximizes the a posteriori pdf p(z|g) of the segmentation label field, given the 
observed image g. Using the Bayes rule, 


p(zlg)* p(glz) p(z) (5.7) 


where p(g|z) denotes the class-conditional pdf. The class-conditional pdf of data, 
given the segmentation labels z, relates the segmentation labels to the data. The term 
p(z) is the a priori pdf, which expresses prior expectations about the segmentation, 
e.g., to impose a spatial-connectivity constraint on the segmentation labels. Thus, 
estimation of the segmentation labels is not only dependent on the image intensity, 
but also constrained by the expected spatial properties imposed by the a priori pdf 
model. In the following, we first introduce the a priori pdf model, then proceed to 
examine the assumptions used in characterizing the conditional pdf. 


A Priori Probability Model 
Derin and Elliott [Der 87] successfully used Gibbs random field (GRF) as a priori 


probability model for segmentation labels. In order to eliminate isolated regions that 
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may arise in the segmentation label field, the GRF model can be designed to assign 
a higher probability for segmentations that have contiguous, connected regions. We 
review GRF representation and how to model smoothness of labels using clique 
potentials in Appendix B. It follows that the æ priori pdf of the label field can be 
expressed by an impulse train with Gibbsian weighting: 


p(z)= gee e a(z- w) (5.8) 


where Q is the finite sample space of the random vector z, ô(+) denotes a Dirac delta 
function, T is the temperature parameter, the normalizing constant 


Q= eu) 
is called the partition function, and U(z) is the Gibbs potential defined by 
U(z)= > Vo (z) 
CEC 


Here, C is the set of all cliques, and Vis the individual clique potential function. 
The single-pixel clique potentials, which reflect a priori probabilities of different 
labels, can be defined as 


Vo (z(x)) =a, ifz(x)=/ andx €C for / =1,2,...,K (5.9a) 


The smaller a, the higher the likelihood of region 4 /=1,2 ... , K. Spatial connec- 


tivity of the segmentation can be imposed by assigning two-pixel clique potentials: 


-P iat) = 2(x,) and x,,x, EC 


.9b 
B if z(x,) = z(x,) and x,,x,EC mee 


Ve (2(x,),z(x,)) = 


where B is a positive constant. The larger the value B, the stronger the smoothness 
constraint. 


Conditional-Probability Model 


‘The original image, s, can be modeled by a mean-intensity function, denoted by the- 
vector W, plus a zero-mean, white Gaussian residual process, r, with variance Or- , i.e., 


s=ptr (5.10a) 
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where all vectors are N-dimensional lexicographic ordering of the respective arrays. 


Then, 
名 一 凡 十 志 (5.108) 


where =r + v is a zero-mean Gaussian process with variance T; =g} +07; and 
a? denotes the variance of the observation noise that is also taken to be a zero- 
mean, white Gaussian process. We will refer to this combined term, €, simply as the 
additive noise term. The original formulation by Derin and Elliott [Der 87] models 
the mean intensity of each image region as a constant, denoted by the scalar 1, 
/=1,2,..., K, for segmenting an image into K regions. That is, elements of p attain 
K distinct values, u, /=1,2,..., K. Based on the model of the observed image in 
(5.10), the conditional-probability distribution is expressed as 


€ (e0) (5 11) 
plg | z) xe 20; . 
where z(x) = / designates the assignment of site x to region /=1,2,...,K. Note 


that maximization of pdf (5.11) alone results in maximum likelihood (ML) image 
segmentation. 
Substituting (5.8) and (5.11) into (5.7), the a posteriori density has the form 





.zal mao] -X cece (s) 


p(zlg) re (5.12) 


We maximize the posterior density (5.12) to find estimates of u ip PDD) cs E 
and the desired segmentation labels z. Note that maximizing (5.12) is equivalent to 
minimizing the cost 


2 
E(z)=)0, agile “Heal 十 二 ae 故国 (5.13) 

The solution to the MAP segmentation problem can be obtained by Monte 
Carlo type methods. A well-known method to reach the global optimum is simu- 
lated annealing. However, because it is computationally complex, we often use a 
sub-optimal method, called iterated conditional mode (ICM), trading optimality for 
reduced computational complexity (see Appendix C). 

Observe that if we turn off the spatial-smoothness constraints, i.e., neglect the 
second term, the result would be identical to that of the K-means algorithm. Thus, 
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we follow a procedure that is similar to that of the K-means algorithm, i.e., we start 
with an initial estimate of class means and assign each pixel to one of the K classes by 
minimizing (5.13), then we update the class means using these estimated labels, and 
iterate between these two steps until convergence. 


Adaptive MAP Method 


Pappas [Pap 92] has proposed an image model with slowly varying means 1,,,)(x), as 
opposed to a constant mean 人 so used by Derin and Elliott [Der 87], for modeling 
intensities within each image region denoted by z(x). An adaptive clustering proce- 
dure has been proposed based on this model, where the cluster means u, vary slowly 
by pixel location x. The MAP method can be made adaptive similarly by letting the 
mean of the class-conditional pdfs vary slowly. Then, the modified class-conditional 
pdf model becomes 


wo [pw (a) 


pglzze 7% (5.14a) 


where Wat) (%) denotes the space variant mean intensity for each class z(x) and the 
a posteriori pdf is given by 





2 
[g(x)—pes,) (x)] SoY (2) 


palge © (5.14b) 


Again, maximizing (5.14b) is equivalent to minimizing 


Ez)=9_, eO] yt (5.15a) 


2 
20; 


The adaptive algorithm follows a procedure that is similar to the two-step itera- 
tions described in the non-adaptive case, except that the cluster means u(x) at site x 
for each region / are estimated as the sample mean of those pixels with label / within 
a local window about the pixel x (as opposed over the entire image). The following 
simplifications are usually performed to reduce the computational burden: 


1. The space-varying mean estimates may be computed on a sparse grid and then 
interpolated. 

2. Te optimization is performed using the ICM method (see Appendix C). Note 
that the ICM is equivalent to maximizing the local a posteriori pdf at a site x,, 
given by 
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"207 g(x; )—Hz(x;) (x; )] -J cede (z) 


p(z(x,)| g(x,),2(x,), all z(x,)EN, )re (5.15b) 

Allowing space-varying class-means offers a couple of advantages: i) it avoids 
oversegmentation since fewer regions are required to segment an image into percep- 
tually meaningful regions; ii) it avoids merging perceptually distinct regions with 
low local intensity contrast. 


Vector-Field Segmentation 


In most applications we deal with the segmentation of multi-channel data such 
as color images or motion-vector fields. The Bayesian segmentation algorithms 
(MAP and adaptive MAP) can be generalized to segment multi-channel data. This 
extension involves modeling the multi-channel data with a vector random field, 
where the components of the vector field (each individual channel) are assumed 
to be conditionally independent given the segmentation labels of the pixels. Note 
that we have a scalar segmentation label field, which means each vector is assigned 
a single label, as opposed to segmenting the channels individually. 

The class-conditional probability model for the vector image field is taken as a 
multi-variate Gaussian distribution with a space-varying mean function. We assume 
that M channels of multi-spectral data are available and denote them by a P-dimen- 
sional (p = N- M) vector [g, g, --- By)’, where g; corresponds to the jth channel. A 
single segmentation field, z, which is consistent with all M channels of data and is in 
agreement with the prior knowledge, is desired. By assuming the conditional inde- 
pendence of the channels given the segmentation field, the conditional probability 
in (5.11) becomes 


Plg1; S2>---> Su |Z) = P(g, |Z) P(g: 1B --- (gu |z) (5.16) 


The extension of the Bayesian methods for multi-channel data following the 


model (5.16) is straightforward. 


5.1.4 Graph-Based Methods 


Graph-based methods construct a graph in which the nodes represent pixels or 
blocks of pixels or over-segmented regions in the image and edges represent affinities 
(couplings) between them. The image is segmented by cutting the graph into sub- 
graphs such that the cost, which is the sum of affinities across the cut, is minimized. 
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Normalizing the cost of a cut by the area of segments [Tao 07] or by a measure 
derived from affinities between nodes within the segments [Shi 00] avoids favoring 
small regions and prevents over-segmentation of the image. 

Let G = (V, E, W) denote a graph, where Vis the set of nodes, F is the set of edges 
(links) connecting the nodes, and Wis an edge-affinity matrix. A pair of nodes i andj 
is connected by an edge with a weight w(ż, j) = w(j, i) > 0, i.e., a measure of affinity 
(dissimilarity) between them. The graph can be partitioned into two disjoint sets, A 
and B= V—A by removing edges connecting the two parts. The degree of dissimi- 
larity between the two sets can be computed as the total weight of removed edges, 
which is called a cut in graph theory: 


cut (A,B) = Dyes jen w(i j) (5.17) 


There are many algorithms that solve the minimum-cut problem in polynomial 
time with small constants [Kol 04]. However, the minimum cut criterion favors 
grouping small sets of isolated nodes in the graph, because it does not contain any 
intragroup information, and as a result causes over-segmentation. The normalized 
cut (Ncut) criterion, given by 


cut (A,B) cut (A,B) 


Neut (A,B) = 
oon ) assoc(A,V) assoc(B,V) 


(5.18) 


where assoc(A, V) and assoc(B, V) denote the total connections from nodes in A and 
in B, respectively, to all nodes in the graph, is usually prefered since it favors equal- 
size regions. 

In the case of the normalized cut criterion, polynomial methods, with runtime 
complexity O(N’logN), where N denotes the number of nodes, exist for finding 
a globally optimal solution when the graph is planar [Shi 00]. When the graph 
is non-planar finding a globally optimal solution is NP-hard, and approximation 
methods are employed. The most popular solution is perhaps the one proposed by 
Shi and Malik [Shi 00]. Let d=} wli, j) denote the total affinity from node i to all 
other nodes, D be an NXN diagonal matrix with diagonal entries d, W be an NXN 
matrix with entries w(i,7), and x be an VX 1 indicator vector, such that x,=1 if node 


i€ A and x,=—1 otherwise. Shi and Malik [Shi 00] have shown that 


y (D—W)y 
” y’Dy 


min, Neut(A,B)= min (5.19) 
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with the condition y, € {1,— b}, where y = (1 + *) bil —x) and ,= A Ify 
f x, <0 i 

is relaxed to take on real values, then the minimization can be achieved by solving 

the generalized eigenvalue problem 


(D—W)y =A Dy (5.20) 


where (D — W) is called the Laplacian matrix. The solution of the normalized cut 
problem is given by the second smallest eigenvector of this eigenvalue problem. Since 
the solution y is real-valued, a threshold must be selected to estimate the indicator 
vector x. A common procedure is to conduct a search over / evenly-spaced possible 
split thresholds to obtain the minimum Ncut(A, B). 

In the Shi and Malik method [Shi 00], the nodes of the graph are individual pixels. 
Felzenszwalb et al. [Fel 04] proposed an efficient implementation that runs in nearly 
linear time with the number of pixels in the image for superpixel formation. An alter- 
native approach is to pre-segment the image into uniform color regions using, for 
example, the MS algorithm and then use the normalized cut approach with nodes of 
the graph taken as the initially oversegmented regions [Tao 07]. Graph-based image 
segmentation methods have also been extended to hierarchical image segmentation 


[Cou 05]. 


5.1.5 Active-Contour Models 


Active-contour models (also called snakes) are parametric planar curves that snap 
to object/segmentation boundaries by minimizing an energy functional. The snake 
was first introduced by Kass et al. [Kas 88], who treated energy minimization as a 
variational calculus problem. The variational approach may suffer from the need for 
estimates of high-order derivatives from discrete data, unpredictable convergence of 
the iterative process, and inability to enforce hard constraints. Instead, Amini et al. 
[Ami 90] used discrete dynamic programming for optimization, which is numeri- 
cally more stable and allows inclusion of hard constraints. 

The user specifies NV node points, x,= (x, »x,,), i= 1, ..., N, on the boundary of 
the desired object. The initial object contour is the union of contour segments that 
are obtained by joining these nodes with straight lines. Mathematically, the initial 
contour segments are given by 


C(s) = (s —4)x,_, +(s—it+))x,, i—l<s<i (5.21) 
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The initial contour is then snapped onto the desired object boundary by mini- 
mizing the total energy functional that can be expressed as the weighted sum of three 
energy terms 

be be ti Be ae Ee (5.22) 
where E,, represents internal forces that promote regularity (smoothness) of the 
curve, E,,,, 4 is the image force that pushes the contour towards high gradient edges, 
and Æ, , helps incorporate prior knowledge about the desired shape of the object 
favoring a particular shape. 

We express the total energy as a sum of energy terms £, over local contour 
segments, 


Eee = pel E,= a, iad Bing + Q image ki dco 十 ae E os, (5.23) 
where each contour segment links up to three nodes x,, x,_,, and x,_,. For closed 
contours, we assume that node 1 is linked to node N; hence, there are N contour 
segments for VV nodes. 

Following Amini et al. [Ami 90], the minimum of (5.23) is searched by a dis- 
crete multi-stage decision process (dynamic programming), where at each stage n we 
minimize 


E” =>", E, n=1,...,N (5.24) 


which can be performed recursively as 
min, {E (j,k) + E,(7,k.d)} (5.25) 


where j,k, and / refer to the indices of all possible search points (perturbations) 
around nodes x, _,,x,_,, and x,, respectively. The minimum snake energy is found 
at the final stage, where E ake = EM. Then, the optimal contour can be computed 
by backtracing the search points at each node that yield the minimum snake energy. 
In this scheme, hard constraints can be easily imposed by discarding connections 
that violate the constraints. The number of nodes along the contour can be increased 
or decreased as desired during energy optimization. If the distance between two 
adjacent nodes after snapping is less than a lower bound, one of them will be deleted; 


or if the distance exceeds an upper bound, new nodes can be added between them. 
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Alternative active-contour formulations based on minimization of the Mumford— 
Shah functional have been proposed [Cha 01, Tsa 01]. These formulations employ 
an active contour to model the set of discontinuities in the Mumford—Shah func- 
tional and, hence, use a global-segmentation model for stopping curve evolution 
as opposed to using local image gradients. Chan and Vese [Cha 01] formulate the 
model in terms of level-set functions and solve the associated Euler-Lagrange equa- 
tions iteratively. 


5.2 Change Detection 


Change detection is employed in digital-video processing in many different contexts 
including temporal segmentation of video into shots, moving object detection and 
tracking, motion-compensated video filtering (deinterlacing, denoising), and mode 
selection for motion-compensated video compression (skip mode). Various change- 
detection methods differ according to: i) what features and scene/background mod- 
els are used, ii) what distance metrics are used, and iii) what kind of threshold and 
scene/background-model adaptation rules are used [Rad 05]. Change-detection 
methods can be classified as: i) shot-boundary detection for detecting abrupt or grad- 
ual transitions between scenes, and ii) frame/background subtraction for motion or 
foreground object detection, which are treated in Section 5.2.1 and Section 5.2.2, 
respectively. 


5.2.1 Shot-Boundary Detection 


Scene-change or shot-boundary detection is a temporal segmentation. Temporal 
discontinuities may be abrupt (cuts) or gradual (special effects, such as wipes and 
fades). It is easier to detect cuts than special effects. Shot-boundary detection meth- 
ods locate global temporal discontinuities, i.e., frames across which large differences 
are observed in some feature space, usually a combination of color and motion [Jia 
98, Gar 00, Kop 01, Lie 01, Han 02]. 


Pixel-Difference Methods 


The simplest approach for detecting temporal discontinuities is to quantify frame 
differences in the pixel-intensity domain. If a pre-determined number of pixels 
exhibit differences larger than a threshold value, then a “cut” can be declared. Clearly, 
this method is sensitive to the presence of camera motion, noise, and compression 
artifacts in the video. A more robust approach may be to divide each frame into 
rectangular blocks, compute statistics of each block such as the mean and variance 
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independently, and then check the count of blocks with changing statistics against 
a set threshold. Applying low-pass filtering to each frame prior to computing frame 
differences or block statistics can improve robustness. 


Histogram-Based Methods 


Histogram differences are generally more robust than pixel-wise or block-wise 
intensity differences. We compute n-bin color histogram, /,(z), i= 1,..., 2, for each 
frame &. Various measures and tests have been developed to quantify similarity or 
dissimilarity of histograms. These include the histogram-intersection measure, chi- 
square test, and Kolmogorov—Smirnoy test [Kop 01]. A closely related approach is 
to detect changes in the counts of edge pixels in successive frames, i.e., similarity of 
edge histograms. Although they are effective at detecting cuts and fades, neither his- 
togram differences nor intensity differences can usually differentiate between wipes 
and camera motion, such pans and zooms. Detection of these special effects requires 
a combination of histogram difference and camera-motion estimation. Global 
motion can be estimated and frames are motion compensated before computation 
of the features [Bou 99]. Another approach to detect gradual changes is the so-called 
twin-comparison method [Zha 93], which can be used with different features. A 
lower threshold is used to detect abrupt scene changes, while a higher threshold is 
used to detect the actual position of gradual ones. 

There also exist shot-boundary detection algorithms for specific domains, such 
as surveillance video [Str 00], sports video [Eki 03], and movies [Ham 95, Sun 
02]. Sports video is arguably one of the most challenging domains for robust shot- 
boundary detection due to: i) the existence of a strong color correlation between 
successive shots, since a single dominant color background, such as the soccer field, 
may be seen in successive shots; ii) existence of large camera and object motions; 
iii) existence of many gradual transitions, such as wipes and dissolves. Ekin et al. 
[Eki 03] observed that gradual transitions in sports video are not accurately detected 
by simple algorithms using a single feature and proposed using two features, the 
absolute difference between two frames of the ratio of dominant colored pixels to 
total number of pixels, and color histogram dissimilarity, measured by histogram 
intersection, for reliable shot-boundary detection. 


Compressed-Domain Methods 


Videos are stored and transmitted in compressed form. Detection of scene changes 
in real-time may pose a challenge in some applications since decompressing and 
processing video data sequentially requires significant computational resources, 
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motivating scene segmentation in the compressed domain (without complete bit- 
stream decoding) [Lel 03]. DC images that are spatially reduced versions of original 
video frames can be constructed from the DC coefficient of each 8 X 8 block in 
intra-coded pictures [Yeo 95]. Successful results have been obtained for detection of 
both abrupt and gradual scene changes using only DC images [Sal 99]. 


5.2.2 Background Subtraction 


Motion-detection or background-subtraction methods segment each frame into 
changed and unchanged regions subject to different requirements in varying con- 
texts. In motion-compensated filtering and compression, motion detection needs 
to detect whether the value of the current pixel is significantly different from the 
value of the co-located pixel in the previous/reference frame/field. In the context of 
computer vision or scene analysis, background subtraction needs to segment a scene 
into meaningful foreground and background regions in order to detect and track 
moving objects. 

This section treats the background-subtraction problem, starting with a discus- 
sion of frame-differencing methods. Motion detection can be considered as a special 
case of frame differencing, where the background model is set equal to the previous 
(or a reference) frame/field. We then introduce adaptive background modeling using 
a mixture of Gaussians, which is more robust to variations in the background [Sta 
99]. Finally, we present a method, visual background extractor ViBe [Bar 11], that 
integrates several concepts. 


Frame Differencing 


The simplest method to detect changes between two properly registered frames is to 
analyze the frame difference (FD) image [Chi 02], which shows pixel-by-pixel differ- 
ences between the current frame 5,(x) and a model frame M (Xx), given by 


FD, (x) = s,(x) — M, (x) (5.26) 


where x = (x,,x,) is pixel location. Assuming a static camera and that illumination 
variations are accounted for in M,(x), we can distinguish non-zero differences that 


are due to noise from those that are due to genuine motion by thresholding the 
PD as 


TE 1 if |FD,(x|>T7 or 


0 otherwise 
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where T is a fixed threshold. Here, z,(x) is called segmentation label field, which is 
equal to “1” for changed regions and “0” otherwise. The value of the threshold 7 can 
be determined by an optimal threshold-determination algorithm or as a function 
of the variance of noise. This pixel-wise thresholding is generally followed by post- 
processing to eliminate the isolated labels. Post-processing includes forming 4- or 
8-connected regions and discarding labels with less than a predetermined number of 
connections and morphological filtering of changed and unchanged region masks. 
The model frame M,(x) is chosen according to the requirements of the problem as: 


。 Motion Detection by Successive Frame Differences: For simple motion/change 
detection, the model frame/field is set equal to the immediate-previous or a past 
reference frame/field. Although successive differences yield satisfactory results 
for motion-adaptive filtering and mode selection in video compression, it often 
does not produce a consistent moving object mask for object tracking, since the 
model frame contains both moving objects and the background. For example, 
the uncovered background belongs to the changed region and, hence, appears 
as part of the moving objects. 

。 Background Subtraction Using a Fixed-Reference Frame: For moving-object 
detection and tracking, if a controlled reference frame that consists of only a 
background (without any foreground object) is available, it can be used as a 
fixed-model frame for all &. For example, if we are interested in monitoring a 
hallway using a fixed camera, an image of the hallway when it is empty may 
be used as a fixed-reference frame. However, this choice is not robust against 
changes in scene illumination and temporal background clutter, which is often 
present in outdoor scenes. 

。 Background Subtraction Using a Filtered-Model Frame: As a compromise between 
these two choices, a model frame can be reconstructed by filtering, e.g., mean or 
median filtering of last N frames, where typically N= 10. The mean filter can 
be implemented as a weighted running average of past frames given by [Ira 94] 


=a) (2) +e Me) k= 1 


bAi s(x) k=0 


where 0<a<1 is the learning coefficient. After processing a few frames, the 
unchanged regions in M,(x) maintain their sharpness with a reduced level of 
noise, while changed regions are blurred and average out to the background 
value. The temporal integration increases the likelihood of eliminating spurious 
labels, thus resulting in spatially contiguous regions. 
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In practice, a simple FD analysis is not satisfactory for background subtraction 
for two reasons: First, a uniform intensity region may be interpreted as stationary 
even if it is part of a moving object (the aperture problem). It may be possible to 
avoid the aperture problem using multi-resolution analysis, since uniform intensity 
regions are smaller at lower resolution levels. Second, the intensity difference due 
to motion is affected by the magnitude of the spatial gradient in the direction of 
motion. This can be addressed by considering a locally normalized FD function [Ira 
94] or locally adaptive thresholding [Ner 98]. An improved multi-resolution frame 
difference analysis that addresses both concerns can be summarized as: 


1. Construct a Gaussian pyramid where each frame is represented in multiple 
resolutions. Start processing at the lowest resolution level. 

2. For each pixel at the present resolution level, compute the normalized frame 
difference given by [Ira 94] 


FDN = 
(x) Da VM, (x)|” hie 


(5.28) 


where N denotes a local neighborhood of the pixel x, VM, (x) denotes the gradi- 
ent of image intensity at pixel x, and c is a constant to avoid numerical instabil- 
ity. If the normalized difference is high (indicating that the pixel is moving), 
replace the normalized difference from the previous resolution level at that pixel 
with the new value. Otherwise, retain the value from the previous resolution 
level. 

3. Repeat step 2 for all resolution levels. 

4. Finally, apply thresholding to the normalized motion-detection function at the 
highest resolution level. 


Temporal memory can be incorporated into the decision process by considering 
accumulative differences over a sequence of N frames. An accumulative difference 
value for each pixel is incremented by one if the difference between the current frame 
and the model frame at that pixel location is bigger than a threshold. Thus, pixels 
with higher counter values are more likely to correspond to changed regions. 


Adaptive Background Modeling by Mixture of Gaussians 


A highly popular approach to modeling more complex backgrounds has been to 
consider each pixel as a temporal-pixel process, such that the distribution of the 
pixel intensity is modeled by a mixture of K Gaussians, where K varies between 
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3 and 5 [Sta 99, Kae 02]. The model is based on the assumption that each observable 
pixel value is generated by one of K different hidden states, representing different 
background or foreground objects/surfaces visible at that pixel, where the pixel- 
intensity distribution given the state k (due to temporal texture or noise and small 
illumination changes) is a Gaussian with mean 2, and variance o}. The parameters 
wp k= 1,...,K, indicate the a priori probability that the pixel is generated by state 
k, where pent w, =1. 

The mixture model enables learning repetitive pixel variations by maintaining 
K model distributions for each pixel; hence, a background model is maintained 
even if it is temporarily replaced by another distribution. Note that a single Gauss- 
ian for each pixel (i.e., K= 1) would be sufficient to model the observation noise, 
under the unrealistic assumption that each pixel resulted from a single surface 
under particular lighting. 

The number of states K, the mean u, and the variance a; of each Gaussian, 
as well as the a priori probability w, of each Gaussian, are all unknowns and must 
be estimated from the observed pixel data. Given K, the maximum likelihood 
solution to this problem can be obtained by the expectation-maximization (EM) 
algorithm [Dem 77]. The EM algorithm works by iterating between two steps: 
E-step: find the expected value of the hidden state using the observed data and 
current estimates of the parameters; and M-step: calculate the maximum likeli- 
hood estimates of the parameters using the observed data and current estimate of 
the hidden state. 

Stauffer—Grimson [Sta 99] provides an online approximation to the EM solution 
that can deal with lighting changes, repetitive variations, tracking through cluttered 
regions, and introduction or removal of objects from the scene. The method has two 
input parameters: the learning coefficient œ and the proportion T of the data that 
should belong to the background. Let the value of a pixel at frame ż be denoted by 
s(x). At each frame t and at each pixel x, we determine the state k that generates pixel 
x such that s (x) is generated by state & if it is within 2.50, of the mean u, Then, the 
mean and variance of the kth Gaussian are updated as 


Me, = a Sy P) Mee + P S, (x) 
Ti, =(1—p)o7,. +p (5,0) Hes) 
p 一 和 N(s, (x) | fhis Th) 


@,, =(l—ajo, ta Mir 
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where N(s,(x)|,,,0;,) denotes a Gaussian with mean Wi, and variance Ti 
and M,,=1 for the matching Gaussian and zero for all others. If the value of s(x) 
does wie match any of the existing Gaussians, the least probable distribution is 
replaced by a new one with the mean s,(x), a high variance, and low probability œ, |. 

The Gaussians with the most supporting evidence (highest a priori probabil- 
ity) and least variance belong to the background model. Hence, the Gaussians are 
ordered from the highest to lowest value of w, ,/o,, and the first B are chosen as the 
background model, where 


b 
B= argmin ) > wo, >T 
i=l 

Background modeling by adaptive mixture of Gaussians offers several advantages 
including: i) a different decision threshold applies to each pixel, ii) thresholds vary by 
time, and iii) multiple background models can co-exist. Perhaps the most important 
drawback of the method is that it requires an initialization phase, and learning can 
be slow especially in the presence of slow-moving objects. Improvements for faster 
learning and shadow modeling are proposed [Kae 02]. 

The case of a moving camera can be handled similarly, once the global camera 
motion between successive frames is estimated and compensated for [Mec 98]. 


Spatial and Temporal Consistency 


Another consideration is to enforce consistency of the boundaries of the changed 
regions with spatial-edge locations at each frame. This may be accomplished by first 
segmenting each frame into uniform color and/or texture regions. Next, each region 
resulting from the spatial segmentation is labeled as changed or unchanged as a 
whole as opposed to labeling each pixel independently. Region-labeling decisions 
may be based on the number of changed and unchanged pixels within each region or 
thresholding the average value of the frame differences within each region. 

The boundary of changed regions is smoothed by a relaxation method using local 
adaptive thresholds [Aac 93]. Memory is incorporated by re-labeling unchanged pix- 
els that correspond to changed locations in one of the last L frames to ensure tem- 
poral continuity of changed regions across frames. The depth of the memory L may 
be adapted to scene content to limit error propagation. Finally, post-processing to 
obtain the final changed and unchanged masks eliminates small regions. 


ViBe Algorithm [Bar 11] 


The visual backgound extractor (ViBe) is a universal motion-detection or back- 
ground extraction method that incorporates a pixel-based temporal model and a 
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policy for spatial propagation of background pixel values. In the following, we pres- 
ent the background model as well as model initialization and model update policies 
of the ViBE algorithm. 


Background Model 


Let s(x) denote the value of pixel x and s, denote a background sample value with 
index i. Each background pixel x is modeled by a collection of N background samples 
M(x) = {5,,55, .… ,sw} taken from previous frames. To classify a pixel value s(x) accord- 
ing to its corresponding model M(x), we compare it to the closest values within the 
set of samples by defining a sphere of radius R centered on s(x). The pixel value s(x) 
is then classified as background if the cardinality of the set intersection of this sphere 
and the collection of model samples M(x) is larger than or equal to a given thresh- 
old. The classification of a pixel value s(x) involves the computation of N distances 
between s(x) and model samples, and of N comparisons with a thresholded Euclid- 
ean distance R. The ViBe model is determined by two parameters only: the radius 
R of the sphere and the minimal cardinality. A radius R= 20 (for monochromatic 
images) and a cardinality of 2 have been suggested as universal parameters, with no 
need to adapt them from frame-to-frame or from pixel-to-pixel. The classification 
step of ViBe compares the current pixel value s (x) to the samples in the background 
model of the previous frame, M,_ (x). 


Background-Model Initialization 


Many techniques in the literature need several dozen frames to initialize their mod- 
els. In order to respond to sudden illumination or scene changes, it is desirable to 
be able to initialize the background model from a single frame, so the existing back- 
ground model can be discarded and a new model initialized instantaneously. To this 
effect, ViBe assumes that neighboring pixels share a similar temporal distribution and 
populates M(x) with values in the spatial neighborhood of each pixel. The size of the 
neighborhood needs to be large enough to have a sufficient number of different sam- 
ples, while the statistical correlation between pixel values decreases as the size of the 
neighborhood increases. The only drawback is that the presence of a moving object 
in the first frame will introduce an artifact called a ghost (i.e., a set of connected 
points, detected as in motion, but not corresponding to any real moving object). In 
this particular case, the ghost is caused by the undesired initialization of pixel models 
with samples coming from the moving object. We note that the ghost fades over time 
through the regular model update process, which learns the real background. 
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Background-Model Update 


The model update step decides which samples will be memorized by the model and 
for how long. The classical model update policy is to discard and replace the oldest 
value after a number of frames. The ViBe update method incorporates three novel 
policies so the model can adapt to changing conditions: 


1. ViBe chooses the sample to be discarded randomly according to a uniform pdf, 
instead of always removing the oldest sample from the pixel model M(x). 

2. Random time sub-sampling: The random replacement policy allows the 
pixel model to cover a large (theoretically infinite) time window with a lim- 
ited number of samples, but in the presence of periodic or pseudo-periodic 
background motions, the use of fixed sub-sampling intervals might prevent 
the background model from properly adapting to these motions. So when a 
pixel value has been classified as belonging to the background, a random pro- 
cess determines whether this value is used to update the corresponding pixel 
model. 

3. A mechanism that propagates background pixel samples spatially to ensure spa- 
tial consistency and to allow the adaptation of the background pixel models that 
are masked by the foreground: ViBe considers that neighboring background 
pixels share a similar temporal distribution and that a new background sample 
of a pixel should also update the models of neighboring pixels. According to 
this policy, background models hidden by the foreground will be updated with 
background samples from neighboring pixel locations from time to time. This 
allows a spatial diffusion of information regarding the background evolution 
that relies on samples classified exclusively as background. ViBes background 
model is thus able to adapt to a changing illumination and to structural evolu- 
tions (added or removed background objects), while relying on a strict conser- 
vative update scheme. 


Other Approaches 


The dominant-motion segmentation approach, as discussed in Section 5.3.1, can 
also be used for foreground/background separation, assuming that dominant motion 
originates either from the background or a foreground object in the scene. Irani and 
Anandan [Ira 98] propose using planar parallax to detect moving objects in 2D/3D 
scenes, e.g., when a scene is approximately flat or when the camera undergoes only 
rotation and zoom. 
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5.3 Motion Segmentation 


Motion-segmentation (also known as optical-flow segmentation) methods label pix- 
els (or optical-flow vectors) at each frame that are associated with independently 
moving parts of a scene. The region boundaries may or may not be pixel-accurate or 
semantically meaningful. For example, a single object with articulated motion may 
be segmented into multiple regions. Occlusion and aperture problems are mainly 
responsible for misalignment of motion and actual object boundaries. Furthermore, 
model misfit possibly due to deviation of the surface structure from a plane generally 
leads to over-segmentation of the motion field. While it is possible to achieve fully 
automatic-motion segmentation with limited accuracy for certain content domains, 
semantically meaningful object segmentation generally requires user interaction to 
define the object of interest in at least some key frames as discussed in Section 5.5. 

Motion segmentation is closely related to two other problems, motion (change) 
detection and motion estimation. Change detection, discussed in Section 5.2, is a 
special case of motion segmentation with only two regions, changed and unchanged 
regions (in the case of a static camera) or global and local motion regions (in the case 
of a moving camera). Change detection in the case of a moving camera and general 
motion segmentation requires some sort of global and/or local motion estimation, 
either explicitly or implicitly. Motion detection and segmentation are also plagued 
with the same two fundamental limitations associated with motion estimation: 
occlusion and aperture problems (see Chapter 4). For example, pixels in a flat-image 
region may appear stationary even if they are moving due to the aperture problem 
(hence, the need for hierarchical methods), and/or erroneous labels may be assigned 
to pixels in covered or uncovered image regions due to the occlusion problem. 

In general, application of standard image segmentation methods directly to esti- 
mated optical-flow vectors may not yield meaningful results, since an object moving 
in 3D usually generates a spatially varying optical-flow field [Adi 85]. For example, 
in a rotating object, there is no flow at the center of the rotation, and the magni- 
tude of flow vectors grows as we move away from the center of rotation. Therefore, 
a parametric model-based approach, where we assume that the motion field can 
be described by a set of K parametric models, is usually adopted. In parametric- 
motion segmentation, the model parameters are the motion features. Then, motion- 
segmentation algorithms aim to determine the number of motion models that can 
adequately describe a scene, type/complexity of these motion models, and the spa- 
tial support of each motion model. The most commonly used types of parametric 
models are affine, perspective, and quadratic mappings, which assume a 3D planar 
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surface in motion. In the case of a non-planar object, the resulting optical flow can 
be modeled by a piecewise affine, perspective, or quadratic-flow field if we approxi- 
mate the object surface by a union of a small number of planar patches. Since each 
independently moving object and/or planar patch will best fit a different parametric 
model, the parametric approach may lead to over-segmentation of motion in the 
case of non-planar objects. 


5.3.1 Dominant-Motion Segmentation 


Segmentation by dominant motion refers to extracting one object (with the dominant 
motion) from the scene at a time [Bur 91, Ber 91, Wu 93, Hsu 94, Ira 94, Aye 95]. 
Dominant-motion segmentation can be considered as a hierarchically structured top- 
down approach that starts by fitting a single parametric-motion model to the entire 
frame and then partitions the frame into two regions, those pixels that are well repre- 
sented by this dominant-motion model and those that are not. The process converges to 
the dominant-motion model in a few iterations, each time fitting a new model to only 
those pixels that are well represented by the motion model in the previous iteration. The 
dominant motion may correspond to the camera (background) motion or a foreground 
object motion, whichever occupies a larger area in the frame. The dominant-motion 
approach may also handle separation of individually moving objects. Once the first 
dominant object is segmented, it is excluded from the region of analysis, and the entire 
process is repeated to define the next dominant object. This is unlike the multiple- 
motion segmentation approaches discussed in the next section, which start with an 
initial segmentation mask (usually with many small regions) and refine them according 
to some criterion function to form the final mask. It is worth noting that the dominant- 
motion approach is a direct method that is based on spatio-temporal-image intensity 
gradient information. This is in contrast to first estimating the optical-flow field between 
two frames and then segmenting the image based on the estimated optical-flow field. 


Segmentation Using Two Frames 


Motion estimation in the presence of more than one moving object with unknown 
supports is a difficult problem. Burt et al. [Bur 91] showed that the motion of 2D 
translating objects can be accurately estimated by using a multi-resolution itera- 
tive approach, even in the presence of other independently moving objects without 
prior knowledge of their supports. This is, however, not always possible with more 
sophisticated motion models (e.g., affine and perspective), which are more sensitive 
to presence of other moving objects in the region of analysis. 
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Irani et al. [Ira 94] proposed multi-stage parametric modeling of the dominant 
motion. In this approach, first a translational-motion model is employed over the 
whole image to obtain a rough estimate of the support of the dominant motion. The 
complexity of the model is then gradually increased to affine and projective models 
with refinement of the support of the object in between. The parameters of each 
model are estimated only over the support of the object, based on the previously used 
model. The procedure can be summarized as follows: 


1. Compute the dominant 2D translation vector d = (d, , d,) over the whole frame 











by solving 
1 1 2 a 1 f 0 
To i d, | =LI (5.30) 
where /,,/,, and Z, denote partials of image intensity with respect to x,, x, 


and ¢. In case the dominant motion is not a translation, the estimated transla- 
tion becomes a first-order approximation of the dominant motion. 
2. Label all pixels that correspond to the estimated dominant motion as follows: 


a. Register the two images using the estimated dominant-motion model. The 
dominant object appears stationary between the registered images, while 
other parts of the image do not. 

b. Detect and label stationary regions between the registered images, which 
can be done by the multi-resolution change-detection algorithm discussed 
in Section 5.2. 

c. In addition to the normalized frame difference (5.28), we define a motion- 
reliability measure as the reciprocal of the condition number of the coef- 
ficient matrix in (5.30), given by [Ira 94] 


A 


max 


where À nin and À nax are the smallest and largest eigenvalue of the coefficient 
matrix. A pixel is classified as stationary at a resolution level if its normalized 
frame difference is low and its motion reliability is high. This step defines 
the new region of analysis for the next step. 
3. Estimate the parameters of a higher-order motion model (affine, perspective, or 
quadratic) over the new region of analysis as in [Ira 94]. 
4. Iterate over steps 2 and 3 until a satisfactory segmentation is attained. 
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Temporal Consistency 


Temporal consistency of estimated dominant object regions can be facilitated by 
defining an internal representation image [Ira 94]: 


(1 一 al s(x,k) +a warp (s(x, —1),5(x,h)), 


PRN pei 
MA P De 


(5:32) 


where warp(A, B) denotes warping image A toward image B according to the 
dominant-motion parameters estimated between images A and B, and 0<a<1. 
As in the case of background subtraction, the unchanged regions in s(x, &) maintain 
their sharpness with a reduced level of noise, while the changed regions are blurred 
after processing a few frames. 

The algorithm to track the dominant object across multiple frames can be sum- 
marized as follows [Ira 94]. For each frame: 


1. Compute the dominant-motion parameters between the internal representa- 
tion image s(x,k) and the new frame s(x,k) within the support M,_, of the 
dominant object at the previous frame. 

2. Warp the internal representation image at frame k—1 toward the new frame 
according to the computed motion parameters. 

3. Detect the stationary regions between the registered images as described in Sec- 
tion 5.2.2 using M,_, as an initial estimate to compute the new mask M,. 

4. Update the internal representation image using Eqn. (5.32). 


Comparing each new frame with the internal representation image as opposed to 
the previous frame allows the method to track the same object. This is because the 
noise is significantly filtered in the internal representation image of the tracked object, 
and the image gradients outside the tracked object are lowered due to blurring. Note 
that there is no temporal motion-constancy assumption in this tracking scheme. 


Multiple Motions 


Multiple-object segmentation can be achieved by repeating the same segmentation 
procedure on the residual image after each dominant object is extracted. Once the 
first dominant object is segmented and tracked, the procedure can be repeated recur- 
sively to segment and track the next dominant object after excluding all pixels belong- 
ing to the first object from the region of analysis. Hence, the method is capable of 
segmenting multiple moving objects in a top-down fashion if a dominant motion 
exists at each stage. 
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Some difficulties with the dominant-motion approach have been reported when 
there’s no overwhelmingly dominant motion. In the absence of competing motion 
models, the dominant-motion approach can lead to arbitrary decisions (relying upon 
absolute threshold values), especially when the motion measure indicates unreliable 
motion vectors (in low spatial-gradient regions). Sawhney et al. [Saw 95] proposed 
robust estimators to partially alleviate this problem. 


5.3.2 Multiple-Motion Segmentation 


Multiple-motion segmentation methods allow multiple-motion models to compete 
against each other at each decision site. They consist of three basic steps, which are 
strongly interrelated: estimation of the number K of independent motions, estima- 
tion of model parameters for each motion, and determination of support of each 
model (segmentation labels). If we assume we know the number X of motions and 
the K sets of motion parameters, then we can determine the support of each model. 
The segmentation procedure then assigns the label of the parametric-motion vector 
that is closest to the estimated flow vector at each site. Alternatively, if we assume we 
know the value of K and a segmentation map consisting of K regions, the parameters 
for each model can be computed in the least-squares sense (either from estimated 
flow vectors or from spatio-temporal intensity values) over the support of the respec- 
tive region. But since both the parameters and supports are unknown in reality, we 
have a chicken-egg problem; i.e., we need to know the motion-model parameters 
to find the segmentation labels, and the segmentation labels are needed to find the 
motion-model parameters. 

Various approaches exist in the literature for solving this problem by iterative 
procedures. They may be grouped as: segmentation by clustering in the motion- 
parameter space [Adi 85, Wan 94, Kru 96], maximum-likelihood (ML) segmen- 
tation [Aye 95, Wei 96, Alt 98], and maximum a posteriori probability (MAP) 
segmentation [Mur 87], which are discussed next. 


Clustering in the Motion-Parameter Space 


A simple segmentation strategy is to first determine the number K of models (motion 
hypotheses) that are likely to be observed in a sequence and then perform cluster- 
ing in the model parameter space (e.g., a six-dimensional space for the case of affine 
models) to find K models representing the motion. In the following, we study two 
distinct approaches in this class: the K-means method and the Hough transform 
method. 


5.3 Motion Segmentation 299 


K-Means Method 


Wang and Adelson (W-A) [Wan 94] employed K-means clustering for segmenta- 
tion in their layered video representation. The W-A method starts by partitioning 
the image into non-overlapping blocks uniformly distributed over the image, and 
fits an affine model to the estimated motion field (optical flow) within each block. 
In order to determine the reliability of the parameter estimates at each block, the 
sum of squared distances between the synthesized and estimated flow vectors is 
computed as 


€ = Ly callv(x)— ¥)[ (5.33) 


where B refers to a block of pixels. If the flow within the block complies with a single 
affine model, the residual will be small. On the other hand, if the block falls on 
the boundary between two distinct motions, the residual will be large. The motion 
parameters for blocks with acceptably small residuals are selected as seed models. 
Then, the seed model parameter vectors are clustered to find the K representative 
affine-motion models. The clustering procedure can be described as: Given N seed 
affine parameter vectors A,, A,,..., Ay» where 


Ae =|" rN (5.34) 


4,6 


find K cluster centers A, A. APP A where K < N, and the label &, A= 1,...,K, 
assigned to each affine parameter vector A,, which minimizes 


The distance measure D between two affine parameter vectors A and A, is given by 


D(A,,A,)= A’ MA, (5.35) 


where M is a 6 X 6 scaling matrix. 
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The solution to this problem is given by the well-known K-means algorithm, 
which consists of the following iteration: 


1. Initialize A,, A,, ..., A, arbitrarily. 
2. For each seed block 7, n = {1,2,...,N}, find k given by 


k = arg min, D(A,,A,) 


where s takes values from the set {1,2, ..., K}. It should be noted that if the 
minimum distance exceeds a threshold, then the site is not labeled, and the cor- 
responding flow vector is ignored in the parameter update that follows. 

3. Define S, as the set of seed blocks whose affine parameter vector is closest to 
A,, k= 1, ..., K. Then, update the class means 


a Domes, A; 


á ‘See 1 


4. Repeat steps 2 and 3 until the class means A, do not change by more than a 
pre-defined amount between successive iterations. 


Statistical tests can be applied to eliminate parameter vectors that are considered 
as outliers. 

Once K cluster centers are determined, a label-assignment procedure is employed 
to assign a segmentation label z(x) to each pixel x as 


z(x) = arg min, ||v(x)— P(A, ;x)|[ (5.36) 


where & is from the set {1,2, ... , K}, the operator P is defined as 


A,X, 十 pa2X2 +a, 








P(A,;x)=| 二 = z 
( ; ) A, 4X + Ay 5X + Ay 6 87) 
and v(x) is the dense-motion vector at pixel x given by 
v (x 
v(x)= (x) (5.38) 
v, (x) 








where v, and v, denote the horizontal and vertical components, respectively. All 
sites without labels are assigned one according to the motion-compensation crite- 
rion, which assigns the label of the parameter vector that gives the best motion 
compensation at that site. This feature ensures more robust parameter estimation by 
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eliminating the outlier vectors. Several post-processing operations may be employed 
to improve the accuracy of the segmentation map. 

The procedure can be repeated by estimating new seed model parameters over 
the regions estimated in the previous iteration. Furthermore, the number of clusters 
can be varied by splitting or merging of clusters between iterations. The K-means 
method requires a good initial estimate of the number of classes K. The Hough trans- 
form methods do not require this information but are more expensive. 


Hough Transform Methods 


The Hough transform is a well-known clustering technique where the data samples 
“vote” for the most representative feature values in a quantized feature space. In a 
straightforward application of the Hough transform to optical-flow segmentation 
using the six-parameter affine-flow model (5.37), the six-dimensional feature space 
4,» +.., 4 is quantized to a number of sets (states) after the minimal and maximal 
values for each parameter are determined. Then, each flow vector v(x) votes for a set 
k of quantized parameters that minimizes 


e’ (x)= £ (x) + £} (x) 
where 
E(x) = n, (x) — 4,1% — 4%) — A3 
E,(x) = v, (x) — A 4X, — A 5X3 — Ao 


The parameter sets that receive at least a predetermined amount of votes are 
likely to represent candidate motions. The number of classes K and the correspond- 
ing parameter sets to be used in labeling individual flow vectors are hence deter- 
mined. The drawback of this scheme is the significant amount of computation and 
memory requirements involved. 

In order to keep the computational cost at a reasonable level, several modi- 
fied Hough methods have been presented. The proposed simplifications include 
[Adi 85]: 


1. Decomposition of the parameter space into two disjoint subsets {a,,4,,a,} X 
{24 45, aç} to perform two 3D Hough transforms. 

2. A multi-resolution Hough transform, where at each resolution level the param- 
eter space is quantized around the estimates obtained at the previous level. 
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3. A multi-pass Hough technique, where the flow vectors that are most consistent 
with the candidate parameters are grouped first. In the second stage, those com- 
ponents formed in the first stage that are consistent with the same flow model in 
the least-squares sense are merged together to form segments. Several merging 
criteria have been proposed. In the third and final stage, ungrouped flow vectors 
are assimilated into one of their neighboring segments. 


Other simplifications that are proposed include the probabilistic Hough trans- 
form [Kir 91] and the randomized Hough transform [Kru 96]. 

Clustering in the parameter space has some drawbacks: i) both methods rely on 
pre-computed optical flow as input, which is generally blurred at motion boundaries 
and may contain outliers; ii) clustering based on distances in the parameter space can 
lead to clustered parameters that are not physically meaningful and the results are 
sensitive to the choice of the weight matrix M and small errors in the estimation of 
affine parameters; and iii) parameter-clustering and label-assignment procedures are 
decoupled; hence, ad-hoc post-processing operations that depend on some threshold 
values are needed to clean up the final segmentation map. The maximum-likelihood 
segmentation method, discussed next, addresses these shortcomings. 


Maximum-Likelihood Segmentation 


Motion-segmentation approaches in general are classified as optical-flow segmenta- 
tion methods, which operate on pre-computed optical-flow estimates, and direct 
methods, which operate on spatio-temporal intensity values. We present here a uni- 
fied formulation that covers both cases. The ML method finds the segmentation 
labels that maximize the likelihood function, which models the deviation of the 
observations (estimated dense-motion vectors or observed intensity values) from a 
parametric description of them (parametric-motion vectors or motion-compensated 
intensity values, respectively) for a given motion model. 


We start by defining the log-likelihood function as 
L(o|z) = log(p(o|z)) (5.39) 


where z denotes the lexicographical ordering of the segmentation labels z(x), which 
takes values from the set {1,2,...,K} at each pixel x. The vector o stands for the 
lexicographic ordering of the observations, which are either estimated dense-motion 
(optical flow) vectors or image intensity values. The conditional probability p(o|z) 
quantifies how well piecewise parametric-motion modeling fits the observations 
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o given the segmentation labels z. If we model the mismatch between the observa- 
tions o(x) and their parametric representations computed by the operator O(A; x), 


e =0(x)— O(A xX) 


where A, denotes the kth parametric-motion model, by white, Gaussian noise with 
zero-mean and variance o”, then the conditional pdf of the observations given the 
segmentation labels can be expressed as 


1 nee) 


plo|z)= Qn) d 4) (5.40) 





where M is the number of observations available at site x, Assuming that the 
parametric-flow model is more or less accurate, this deviation is due to the pres- 
ence of observation noise (given correct segmentation labels). Then, the problem is 
finding the K-motion models, A, A,,...,Aj. and a label field z(x)to maximize the 
log-likelihood function L(o|z). We consider two cases. 


Case 1 — Pre-Computed Optical-Flow Segmentation 


The observation o(x) stands for the estimated dense-motion vectors v(x), and opera- 
tor O stands for the parametric-motion operator P given by Eq. (5.37) or a higher- 
order model (see Section 4.2.3) given by 


D (x)= ax tax, ta Faa + axx, (5.41a) 
0, (x) = a,x, +a,x, +4, +a,x,x, ta (5.41b) 

Then, 
人 > (,)=[o,@,)— 4, wT + [v,(x,) — 0, (x, Ni (5.42) 


is the norm-squared deviation of the actual flow vectors from what is predicted by 
the quadratic-flow model. This case concerns motion segmentation by motion- 
vector matching. 


Case 2 — Direct Segmentation 


The observation o(x) stands for the scalar-pixel intensities / (x) at frame ż, and the 
operator O is the motion-compensation operator Q, defined by 
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QU raa) = Ta) (5.43) 
where 
x! 
X/ 一 ; : +P(A w:x) 
Xa Xa 
Then, 
e (x)= (10) -Te) (5.44) 


This case is segmentation by motion-compensated intensity matching. Motion 
parameters A, are estimated over the support of model & (see step 3 below). 

In either case, assuming that the variances for all classes are the same, maximiza- 
tion of the log-likelihood function is equivalent to minimization of the cost function 


2a lb E O(A.wsxj| (5.45) 


or equivalently 


D Dle- ow} 


k=1 xeZ, 


where Z, is the set of pixels x with motion label z(x) = & and O,(x) = O(A,; x). 
An iterative solution to this problem is given by 
1. Initialize A, A... Ap- 
2. Assign a motion label z(x) to each pixel x as 


z(x) = arg min, ||o(x)— O(A,;x)|[ (5.46) 


where & takes values from the set {1, 2,..., K}. 
3. Update A,, A... Agas 


— 全 一 一 . 2 
A, =argmin, >), |lv(x) — P(Asx)| (5.47) 
This minimization is equivalent to least-squares estimation of the affine-motion 


model fit to those motion vectors with the label z(x) = & A closed form solution 
to this problem can be expressed in terms of a linear matrix equation 
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(5.48) 


for all x such that 2(x) = &. 
4. Repeat steps 2 and 3 until the class means A, do not change by more than a 
predefined amount between successive iterations. 


This method does not require gradient-based optimization or other numeric 
search procedures for optimization of a cost function. Thus, it is robust and compu- 
tationally efficient. Extensions using mixture modeling and robust estimators that 
require gradient-based optimization have also been proposed [Aye 95]. 

Motion-vector matching is a good segmentation criterion when the estimated 
motion field is accurate; i.e., if all outlier motion estimates are eliminated. Motion- 
compensated intensity matching is a more suitable criterion when spatial-intensity 
(color) variations are sufficient and/or a multi-resolution labeling procedure is 
employed. 

A possible limitation of the ML segmentation framework is that it lacks con- 
straints to enforce spatial and temporal continuity of the segmentation labels. Thus, 
ad-hoc procedures are needed to eliminate small, isolated regions in the segmenta- 
tion label field. The MAP segmentation strategy promises to impose continuity con- 
straints in an optimization framework. 


Maximum A Posteriori Probability Segmentation 


The MAP method is a Bayesian approach that searches for the maximum of the a 
posteriori pdf of the segmentation labels given the observations (either estimated 
optical flow or observed intensity data). The a posteriori pdf is not only a measure 
of how well the segmentation labels explain the observed data, but also how well 
they conform to our prior expectations. The MAP formulation differs from the ML 
approach in that it includes smoothness terms to enforce spatial continuity of the 
output-motion segmentation map. 

The a posteriori pdf p(z|o) of the segmentation label field z given the observed 


data o can be expressed, using the Bayes theorem, as 


306 Chapter 5. Video Segmentation and Tracking 


plo | z) p(z) (5.49) 


p(z|o)= RE 


where p(o|z) is the conditional pdf of optical-flow vectors given the segmentation z 
and p(z) is the a priori pdf of the segmentation labels. Notice that: i) z is a discrete- 
valued random vector with a finite sample space Q, and ii) the pdf p(o) is constant 
with respect to segmentation labels and, hence, can be ignored for the purpose of 
computing z. The MAP estimate, then, maximizes the numerator of (5.49) over all 
possible realizations of the segmentation label field z = w where pixel labels w € Q. 

Modeling of the conditional pdf p(o|z) through (5.40) and (5.42) or (5.44) has 
been discussed while presenting the ML method. The prior pdf is modeled by a 
Gibbs distribution, which effectively introduces local smoothness constraints on the 
segmentation. The form of the prior pdf is given by Eqn. (5.8) in Section 5.1.3. Prior 
constraints on the structure of the segmentation labels, such as spatial smoothness, 
can be specifed in terms of the clique potentials (defned in Appendix B). Temporal 
continuity of the labels can similarly be modeled [Mur 87]. 

Substituting (5.40) and (5.8) into (5.49) and taking the logarithm of the result- 
ing expression, maximization of the a posteriori pdf can be performed by minimizing 
the cost function 

nel 
20” 





DA €?(x,) +U(@) (5.50) 


The first term describes how well the predicted data fit the actual measurements 
(estimated optical-flow vectors or observed image-intensity values) and the second 
term measures how well the segmentation conforms to our prior expectations. 

Because the motion-model parameters corresponding to each label are not 
known a priori, MAP segmentation must alternate between estimation of model 
parameters and assignment of the segmentation labels to optimize the cost function 
(5.50). Murray and Buxton [Mur 87] were the first to propose a MAP segmentation 
method, where the optical flow was modeled by a piecewise quadratic-flow field 
(5.41) and the segmentation labels were assigned based on a simulated annealing 
(SA) procedure (see Appendix C). Given the estimated fow feld v and the number 
of independent-motion models K, MAP segmentation using the Metropolis algo- 
rithm can be summarized as follows: 


1. Start with an initial labeling z of the optical-flow vectors. Calculate the model 
parameters 4,,4,,...,4, for each region using least-squares fitting (similar to 
Egn. (5.48)). Set the initial temperature for SA. 
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2. Update the segmentation labels at each site x, as follows: 


a. Perturb the label z, = z(x,) randomly. 
b. Decide whether to accept or reject this perturbation, based on the change 
AE in the cost function (5.50), 


1 


2 
Oo 





AE = 3 Asa) t Een, AV. (2(x,),2(x,)) (5.51) 
where N, denotes a neighborhood of site x, and Ve(z(x,),2(x,)) is given 
by Eqn. (5.9). The first term indicates whether or not the perturbed label is 
more consistent with the given flow field determined by the residual (5.42), 
and the second term reflects whether or not it is in agreement with the prior 
segmentation field model. 

3. After all pixel sites are visited once, re-estimate the mapping parameters for each 
region based on the new segmentation-label configuration. Note that the order 
in which the sites are visited affects the result because the update at each site is 
dependent on the labels of neighboring sites. 

4. Exit ifa stopping criterion is satisfied. Otherwise, lower the temperature accord- 
ing to a predefined temperature schedule, and go to step 2. 


We can make the following observations: i) The MAP method carries a high com- 
putational cost. ii) The procedure proposed by Murray—Buxton suggests perform- 
ing the model parameter update (step 3 above), after each and every perturbation. 
We did not notice a significant difference in performance if the motion-parameter 
updates were done after all sites are visited once. iii) The method can be applied with 
any parametric-motion model, although the original formulation has been devel- 
oped on the basis of the eight-parameter model. 

In addition to its high computational cost, the pixel-based MAP method cannot 
guarantee that the estimated motion boundaries coincide with spatial color edges 
(object boundaries). We next present an alternative ML region labeling approach to 
address this problem. 


5.3.3 Region-Based Motion Segmentation: 
Fusion of Color and Motion 


Fusion of contrast/color and motion can yield more robust video segmentation. This 
section extends the pixel-based ML method to region-based ML motion segmentation, 
where the image is first segmented into homogeneous color regions, and each color 
region is assigned a single motion label. It is generally true that motion boundaries 
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coincide with color-segment boundaries, but not vice versa; i.e., color segments are 
almost always a subset of motion segments as illustrated in Figure 5.2. Therefore, one 
can first perform color segmentation to obtain a set of candidate motion segments, such 
that each region has a single motion. Other approaches for region definition include 
superpixels (VX N blocks) and mesh-based partitioning of frames. The region-based 
motion label assignment strategy facilitates obtaining spatially continuous segmenta- 
tion maps that are more closely related to actual object boundaries, without the heavy 
computational burden of the pixel-based Markov random field (MRF) model approach. 

We assume that a region-formation procedure (e.g., color segmentation using fuzzy 
C-means algorithm [Lim 90]) has been performed on each video frame. We let C(x) 
denote the region map of a frame consisting of M mutually exclusive and exhaustive 
regions and define Cas the set of pixels x with the region label C(x) = m, m= 1,..., M. 

We seek the motion-segmentation vector z (formed by lexicographic ordering of 


2(x)) and the affine parameter vectors A,, A,,...,A, that best fit the dense-motion 
field, such that 


M Dyce, |W) — PAm (5.52) 


is minimized. Here, P is an operator defined by Eqn. 5.37, 2(m) refers to the motion 
label of all pixels within C „and takes one of the values 1, 2, ..., K, and v(x) is the dense- 
motion vector at pixel x as defined by Eqn (5.38). The procedure is given by [Alt 98]: 


1. Initialize the motion-segmentation map z by assigning a single motion label &, 
k=1,...,Kto each C 
2. Update the parameter vectors A,, A,,...,A, as 


A, =argmin, Eez ||v(x)— P(Asx)|[ (5.53) 


where Z, is the set of pixels x with the label z(x) = & This minimization can be 
achieved by solving the linear matrix equation 


a, 


(5.54) 











for all xEZ,. 
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(b) (c) 





(d) (e) 


Figure 5.2 Video segmentation: (a) a frame of “Mother and Daughter” sequence; (b) color- 
only pixel segmentation; (c) motion-vector field between two frames; (d) pixel-based motion 
segmentation; and (e) and color-region-based motion segmentation. 


3. Assign a motion label to each region Co m= 124554, such that 
2(C,,) = arg min, pce. lo) — O(A,;x)|| (5.55) 


where = 1,2, ... , Kand o(x) and O(A,; x) are as defined in Section 5.3.2. 
This allows region-based affine motion segmentation with pixel-based motion- 
vector or intensity matching. 

4. Repeat steps 2 and 3 until the class means A, do not change by more than a 
pre-defined amount between successive iterations. 


It can be clearly seen from Figure 5.2(e) that region-based label assignment 
results in a better segmentation of the head of the woman compared to pixel-based 
segmentation in Figure 5.2(d). We note that the pixel-based ML-motion segmen- 
tation presented in Section 5.3.2 is a special case of this region-based framework, 
where each region C, contains a single pixel. 


5.3.4 Simultaneous Motion Estimation and Segmentation 


Until now, we have discussed methods to compute segmentation labels from either 
pre-computed optical flow or directly from intensity values, but have not addressed 
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how to compute an improved dense-motion field along with the segmentation map. 
It is clear that the success of optical-flow segmentation is closely related to the accu- 
racy of the estimated optical-flow field (in the case of using pre-computed flow), 
and vice versa. It follows that optical-flow estimation and segmentation should be 
addressed simultaneously in a mutually beneficial manner for best results. Here, we 
present a simultaneous Bayesian approach based on a representation of the motion 
field as sum of a parametric field and a residual field. The interdependence of optical- 
flow and segmentation fields is expressed in terms of a Gibbs distribution within the 
MAP framework. The resulting optimization problem, to find estimates of a dense 
set of motion vectors, a set of segmentation labels, and a set of mapping parameters, 
is solved using the highest confidence first (HCF) and iterated conditional mode 
(ICM) algorithms. 


Motion-Field Model and MAP Framework 


We model the optical-flow field v(x) as sum of a parametric-flow v(x) and a non- 
parametric residual field v,(x) that accounts for local motion and other modeling 
errors, i.e., 


v(x) = v(x) + v, (x) (5.56) 


The parametric component of the flow v(x) is calculated from the model param- 
eters A, ¿= 1,... K, which in turn is a function of v(x) and z(x). The simultaneous 
MAP framework aims at maximizing the a posteriori pdf 


Phin | Ze Yi Y22) P(Y, V |Z, gi) p(z| g,) 


PV Vas | Bro Serr) = 
Plr | Se) 


(5.57) 
with respect to the optical-flow vectors v,, v, and the segmentation labels z, where 
v, and v, denote the lexicographic ordering of the first and second components of 
the flow vectors v(x) = [v (x) v,(x)] 7 at each pixel x. Through careful modeling of 
these pdfs, we can express an interrelated set of constraints that help improve both 
optical-flow and segmentation estimates. 

The first term p(g 11 | Zp Vp V2) in the numerator of (5.57) provides a measure 
of how well the present displacement and segmentation estimates conform to the 
observed frame & +1 given frame &. It is modeled by a Gibbs distribution as 


—U (geri lZe V1 V2 5%) 


1 
Plgen vie = £ (5.58) 


1 
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where Q, is the partition function (normalizing constant) and 


U, (Ery | Bk’ Yı 2 = dole. (x)= Seri (x+ v(x)AD] 


is called the Gibbs potential, which corresponds to the norm-square of the displaced 
frame difference (DFD) between the frames g,(x) and g,, ,(x). Thus, maximization of 
(5.58) imposes that v(x) minimizes the DFD. 

The second term is the conditional pdf of the displacement field given the motion 
segmentation and the search frame &. It is also modeled by a Gibbs distribution 


—U, (vi,v2|z) 


1 
plVisV, |z, g) = P(Y oV |z)= o” (5.59) 


where Q, is a constant and 


2 


U,(v,,v,|z)=a>d, | v(x) — ¥(x) 
PBE Da Ei | v(x,) — v(x ;) If d(z(x,)—2(x,)) (5.60) 











is the corresponding Gibbs potential, ||- || denotes the Euclidian distance, and N, 
is the set of neighbors of site x. The first term in (5.60) enforces a minimum norm 
estimate of the residual-motion field v (x); i.e., it aims to minimize the deviation of 
the optical-flow estimates v(x) from the parametric-motion field V(x) while mini- 
mizing the DFD. The second term in (5.60) imposes a piece-wise local smoothness 
constraint on the optical-flow estimates for those sites in the neighborhood N, that 
has the same segmentation label with site x. Thus, spatial smoothness is enforced 
only on the flow vectors within a single region. The parameters œ and B allow for 
relative scaling of the two terms. 

The third term in (5.57) models the a priori probability of the segmentation field 


in a manner similar to explained in Section 5.1.3. It is given by 


1 -U;(z 
p(2| g) = p(s)= Yuene (2 —@) (5.61) 
Q; 
where Q denotes the sample space of the discrete-valued random vector z, and Q, 
and U,(z) are as defined in Eqn. (5.8). The dependence of labels on image intensity 
is usually neglected, although region boundaries generally coincide with intensity 


edges. 
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Two-Step Iteration Algorithm 


Maximizing the a posteriori pdf (5.57) is equivalent to minimizing the cost function 


E =U, (ga | 49 V1:V29%) + U,(v,,v, |2) +U;(z) (5.62) 


that is composed of the potential functions in Eqs. (5.58), (5.59), and (5.61). 
Direct minimization of (5.62) with respect to all unknowns is too difficult, 

because motion and segmentation fields contain a large number of unknowns. To 

this effect, we minimize (5.62) through the following two-step iteration [Cha 97]: 


1. Given the best available estimates of the motion parameters A,, i = .,K, and 
z, update the optical-flow field v(x). This step involves the minimization of a 
modified cost function 


E,(v(x)) =D, [ ge) — gens (x + v(x) At) )—¥(x)|f 





+P Dg, Dis ew. [vev 站 5(2(x,) —z(x,)) (5.63) 


which is composed of all terms in (5.62) that contain v(x). While the first term 
indicates how well the motion vectors v(x) explain observations, the second and 
third terms impose that 'they should conform to the parametric-flow model, 
and vary smoothly within each region, respectively. To minimize this energy 
function, we employ the HCF method proposed by Chou and Brown [Cho 
90]. HCF is a deterministic method designed to efficiently handle optimization 
of multi-variable problems with neighborhood interactions. 

2. Update the segmentation label field z, assuming that the optical-flow field v(x) 
is known. This step minimizes all terms in (5.62) that contain z as well as v(x), 
given by 


E,(z)= 





xP +X, Ce ew, Velz(x,),2(x,)) (5.64) 


The first term in (5.64) quantifies the consistency of ¥(x) and v(x). The second 
term is the a priori probability of the present configuration of segmentation 
labels. We use the ICM procedure to optimize E, with respect to z [Cha 97]. 
The mapping parameters A,, i= 1,..., Kare updated by least-squares estimation 
within each region. 
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An initial estimate of the optical-flow field can be found by Bayesian estima- 
tion using a global smoothness constraint. Given this estimate, the segmentation 
labels can be initialized by a procedure similar to Wang and Adelson’s [Wan 94]. ‘The 
free parameters a, 8, and y may be chosen so that each term in the cost function 
(5.62) has equal emphasis. However, because the optimization is implemented in 
two steps, the ratio a/y also becomes of consequence. It is recommended to select 
1 = a/y = 5, depending on how well the motion field can be represented by a 
piece-wise-parametric model and whether we have a sufficient number of classes to 
model the segmentation labels. 

A hierarchical implementation, where v}, v,, and z can be estimated at different 
resolutions, is possible by constructing Gaussian pyramids of the images g, and g, 1. 
The results of each hierarchy level are used to initialize the next level. Note that the 
Gibbs model for the segmentation labels has been extended to include neighbors in 
scale by Kato et al. [Kat 93]. 

Several other motion-analysis approaches can be formulated as special cases of 
this framework. If we retain only the first and third terms in (5.62), and assume that 
all sites possess the same segmentation label, then we have Bayesian-motion esti- 
mation with a global smoothness constraint. The motion-segmentation algorithm 
of Murray and Buxton [Mur 87] (Section 5.3.2) employs only the second term in 
(5.60) and third term in (5.62) to model the conditional and prior pdf, respectively. 
Wang and Adelson [Wan 94] rely on the first term in (5.60) to compute the motion 
segmentation (Section 5.3.2). However, they also take the DFD of the parametric 
motion vectors into consideration when the closest match between the estimated 
and parametric-motion vectors, represented by the second term, exceeds a threshold. 


5.4 Motion Tracking 


There are various technologies for motion tracking, including inertial sensing (e.g, 
accelerometers and gyroscopes), radio sensing (e.g., radio frequency identification 
(RFID) and global positioning system (GPS) tracking), vision-based methods, and 
hybrid methods that combine multiple sensors. In the following, we only discuss 
vision-based or visual tracking methods. 

Visual-motion tracking computes temporally linked feature points (tracks or 
trajectories) or spatio-temporal segmentation maps (tubes) for one or more target 
objects in consecutive video frames by associating their appearance. Hence, track- 
ing can be considered as causal spatio-temporal video segmentation, where color/ 
motion-segmentation methods are extended to determine the map (template) of 
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an object in the current frame, given its map in the previous frame. The temporal 
association of templates can be difficult when the tracked object gets occluded, or its 
shape 3D orientation changes from frame to frame, or when it is moving too fast. 
Motion tracking may sometimes refer to estimating 3D-motion trajectory of a mov- 
ing camera capturing a static scene, which is an important problem in robotic vision 
for self-navigation. 

Various motion-tracking methods differ in how they model the appearance 
(color and/or shape) of moving objects and dynamics of motion. Some methods 
model objects by a cloud of points, some by fixed or adaptive templates, and others 
just model their contours. Hence, visual motion-tracking methods can be broadly 
classified as i) feature-point trackers, ii) template trackers, and iii) contour trackers. 
The temporal dynamics can be modeled in a prediction-update framework (e.g., par- 
ticle filtering) or energy-minimization framework (e.g., active-contour modeling). 

Feature-tracking methods can be classified as marker and markerless tracking 
methods. A set of fiducial markers with known color and size are often used in 
computer-vision applications, such as motion capture and augmented reality, to 
track the 3D pose of a marker coordinate system with respect to the camera coor- 
dinate system. Image of markers observed by the camera can be matched with the 
original known marker patterns. The pose of the marker with respect to the camera 
can be recovered using standard pose-estimation techniques. In the absence of mark- 
ers, a set of image feature points, such as corner points, which can be automatically 
detected by means of image analysis, can be used for tracking. 

Template-tracking methods employ a bounding box or an arbitrary-shaped tem- 
plate that can be tracked from frame to frame. The general idea is to project the 
current template into the next frame using color only (mean-shift or graph-based) or 
2D-motion and color cues (KLT tracking or particle filtering discussed in Sections 
5.4.2 and 5.4.4, respectively). The projected template can then be fine-tuned by 
morphological or other operators using color and edge information to obtain a more 
precise segmentation map to alleviate motion-estimation errors as well as include 
newly uncovered regions. Many algorithms employ fixed target/object appearance 
models, which are determined or trained before tracking begins and, hence, ignore 
changes in object texture and shape due to pose variations or lighting conditions 
during tracking. Some researchers address the template update problem or incre- 
mental learning of a low-dimensional object representation for more robust tracking 
in cluttered environments [Mat 04, Ros 08]. 

Active-contour trackers model and track connected contour segments by energy- 
minimization methods. We discuss specific tracking methods next. 


5.4 Motion Tracking 315 


5.4.1 Graph-Based Spatio-Temporal Segmentation 
and Tracking 


Graph-based video segmentation is a direct extension of the graph-based image- 
segmentation methods discussed in Section 5.1.4. They form super-voxels, which 
are homogeneous space-time regions [Xu 12]. Grundmann et a/. [Gru 10] gener- 
alize Felzenszwalb—Huttenlocher [Fel 04] graph-based segmentation to obtain an 
initial over-segmentation of a video volume into super-voxels by building a 3D 
graph. They use a tree structure to represent the segmentation hierarchy. They 
obtain regions that exhibit long-term temporal coherence by combining a volu- 
metric over-segmentation with a hierarchical re-segmentation applying the same 
algorithm and using optical flow as a region descriptor for graph nodes. However, 
direct extensions of graph-based image segmentation methods are not causal, since 
they require access to the entire video sequence. To this effect, Grundmann et al. 
[Gru 10] proposed a clip-based processing approach to limit processing delay and 
memory usage. 

A causal graph-based video segmentation has been proposed in [Cou 13], which 
first segments frame & into super-pixels using the Felzenszwalb—Huttenlocher [Fel 
04] method. Then, a graph is formed linking regions in frame k— 1 (computed at the 
previous step) to super-pixels in frame &. Edge weights for links depend on the num- 
ber of pixels in a region (frame &—1) and linked super-pixel (frame 月 , the difference 
of their mean color, and the distance between their centroids. The final segmentation 
for frame & is computed by a minimum spanning forest procedure. 

Graph-based video volume-segmentation methods are fully automatic and, 
unlike some tracking methods, do not require initialization or initial target detec- 
tion. Hence, they may be used as pre-processing before some tracking applications 
that do not have strict real-time requirements. 


5.4.2 Kanade—Lucas—Tomasi Tracking 


In Chapter 4, we discussed Lucas and Kanade motion estimation [Luc 81], which is 
an iterative method to compute incremental displacements to register a template and 
an image. Later, Tomasi and Kanade [Tom 91] published a technical report, in which 
this method is used for tracking “good features” that satisfy certain criteria. In a later 
paper, Shi and Tomasi [Shi 94] proposed an additional step to verify that features are 
tracked correctly. Trackers, which follow a methodology based on these three papers 
and their extensions, are called Kanade—Lucas—Tomasi (KLT) trackers. KLT tracking 
can be used for feature-point tracking or template-based object tracking. 
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Feature Tracking — Good Features to Track 


KLT can be used as a feature tracker by selecting “good feature points” based on a 
small template (e.g., 15 X 15) about each feature point [Tom 91]. A feature can be 
tracked reliably if a numerically stable solution to Eq. (4.32) can be found, which 
requires that the matrix H is well-conditioned. This is satisfied if the smaller eigen- 
value is well above the noise level. That is, if Ay and A, are eigenvalues of H, the 
corresponding feature is a good feature to track if min(A,,A,) >A, where A is a 
threshold [Tom 91, Shi 94]. The main steps of the KLT feature tracker are: 


1. Detect a set of feature points in the initial frame using a feature detector, such 
as Harris corner detection, where min(A,,A,) >A. 

2. Find frame-to-frame correspondence vectors for each feature point using 
Lukas—Kanade motion estimation with a translation or affine-motion 
model (see Section 4.4.1) based on a local template about each feature point 
[Bak 04]. 

3. For each feature point, verify goodness of tracking at each frame. Some features 
can be removed (to eliminate those that are occluded or cannot be tracked accu- 
rately) and new ones may be added periodically (e.g., every five frames). 


Shi and Tomasi [Shi 94] defined a good feature as a feature that can be tracked 
well over many frames without drift. To verify this, an affine transformation is 
fit between the image of the currently tracked feature and its image from a non- 
consecutive previous frame. If the affine-compensated image is too dissimilar the 
feature is dropped. When the current and reference templates are not related by 
an affine warping, the tracking residual ¢, between the ith template in the current 
and reference frames is an outlier, i.e., is not a sample from Gaussian distribution. 
Hence, the detection of bad features reduces to a problem of outlier detection, 
which is equivalent to estimating the mean and variance of a corrupted Gauss- 
ian distribution. Tommasini [Tom 98] proposed a simple model-free robust rejec- 
tion rule, using median and median deviation instead of the mean and standard 
deviation. This rule prescribes to reject values that are more than & median absolute 
deviations (MADs) away from the median, where 


MAD = med, {le — med, él} 


A value of k= 5.2, under the hypothesis of Gaussian distribution, is adequate in 
practice, as it corresponds to about 3.5 standard deviations, and the range contains 
more than the 99.9% of a Gaussian distribution [Tom 98]. In order to limit the 
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adverse effects of slow intensity changes from frame-to-frame, the average gray level 
in the original and warped blocks can be subtracted in the computation of €, 


Template Tracking 
The KLT tracking framework has also been used for template tracking [Bak 04], 


where an example appearance of the object, extracted in the first frame as a tem- 
plate, is tracked in the remaining frames using parametric transformations (warp- 
ing) of the template. An important challenge in template tracking is handling 
variations in object appearance during tracking that may be due to intrinsic fac- 
tors, such as 3D-motion and/or shape deformation, or extrinsic causes, includ- 
ing illumination change, camera motion, and occlusions. A naive solution to this 
problem is to update the template every frames with a new image at the current 
template location. The problem with this naive approach is “drift” [Mat 04, Sch 
07]. Each time the template is updated, small errors are introduced in the template 
location, which accumulate in time, and the template steadily drifts away from 
the object. A template-update algorithm that avoids drift has been proposed in 
[Mat 04], which retains multiple templates including the initial template from the 
first frame. The template is first updated with the image of the object at the cur- 
rent template location, which is then aligned with the initial template to compute 
the final updated template. An alternative sub-space projection framework, under 
“sub-space constancy assumption,” called eigentracking has also been proposed for 
robust appearance-based tracking [Bla 98]. KLT tracking does not employ a tem- 
poral model of motion dynamics to enforce temporal consistency. This issue is 


addressed in Section 5.4.4. 


5.4.3 Mean-Shift Tracking 


While KLT requires a motion model, the MS algorithm (see Section 5.1.2) can be 
used for appearance-based (color only) tracking of objects (templates or blobs) where 
it is difficult to specify an explicit parametric-motion model. The target object is 
characterized by a color-histogram (pdf), 4,,/=1,...,Z, where L is the number of 
bins, which is computed within a NX N bounding box. We iteratively compute the 
location X in the current frame that maximizes the Bhattacharyya coefficient 


p=) = Dal AES (5.65) 


between the histogram of the target template {/,} and the histogram {p{x)} of the 
bounding box centered at a location X in the current frame. The maximum (mode) 
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of p(x) will be computed by using the MS algorithm, which consists of three main 
steps: histogram computation, weight calculation, and finding the new location. The 
complete algorithm is given by [Com 03]: 


1. Select an NX N bounding box of a target (template) to be tracked in the initial 
frame (frame 0). Compute the Z-bin color histogram 4,,/=1,...,L, of the 
template. Go to frame j= 1. 

2. Set the MS iteration k= 0. Initialize the center XY of the bounding box. 

Compute the color histogram 4 p, (x )) of the current bounding box. 


> 


4. Compute weights for each pixel i= 1,...,N? within the bounding box 
wi) = SB st} = A] 


where the Kronecker delta 6 [s(x,) — /] denotes the histogram bin corresponding 
to color of the pixel s(x,). 
5. Update the center of the bounding box using the MS iteration 


6. If lex —xl |< ô, increment the frame counter j=j+ 1 and go to step 2. 
Otherwise, set k= k+ 1 and go to step 3 to continue MS iterations. 


In step 5, weighting using the Epanechnikov kernel g(x) = — k’ (x) may be used as 
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In some cases, the size of the target varies from frame-to-frame as it may move 
toward or away from the camera. Then, we need to adapt the size of the kernel 
(window) to obtain the best results. It has been proposed to run the algorithm three 
times, with window size N = N, N ew 7 nint(1.1N), and N „= nint(0.9N), and 
then choose the best result [Com 03]. 
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5.4.4 Particle-Filter Tracking 


Particle filter is a Monte Carlo method used to compute a Bayesian state estimate 
recursively by updating a posteriori pdf of the state at each time k based on all avail- 
able information up to &. It is based on a probabilistic state-space formulation that 
requires a system model describing the evolution of the state with time and an obser- 
vation model relating noisy measurements to the state at each time k. An optimal 
Bayesian estimate of the state and a measure of accuracy of the estimate may be 
obtained from the a posteriori pdf. Note that if the system and observation models 
are linear, and pdfs are Gaussian (e.g., clutter-free), the optimal recursive Bayesian 
estimate is given by the Kalman filter. When we have a non-linear system and/or 
non-Gaussian noise (e.g., cluttered background), it is often not possible to write 
closed-form expressions for the conditional and a priori pdfs (hence, the a posteriori 
pdf). Thus, particle filtering extends Kalman filtering to the case of non-linear and 
non-Gaussian models. 

The key idea of particle filtering is to represent probability distributions by a 
weighted sample set (particles) [Aru 02]. The particle filter takes a large number 
of particles to represent the underlying distribution and updates particles at each 
time k, which approaches updating the a posteriori pdf as the number of particles 
goes to infinity; hence, the particle filter approaches the optimal Bayesian esti- 
mator. In template tracking, each particle represents a particular guess for the 
location of the object (template) tracked. The set of particles with more weight 
shows locations where the object is more likely to be. This weighted distribution 
is propagated through time, and we can determine the template trajectory by tak- 
ing the particle with the highest weight or the weighted mean of the particle set 
at each time step. 

Particles are sampled randomly from the prior probability distribution p(x,) of 
the state vector x,. Each state vector typically consists of coordinates of a pixel and its 
motion vector and color attributes within a local neighborhood, which can be a rect- 
angle or ellipse centered at the pixel. The number of particles varies between 50 and 
500. The evolution of the set of particles from frame-to-frame is described by propa- 
gating each particle according to a dynamic system model, such as a constant veloc- 
ity or constant acceleration model. Each particle is then weighted according to the 
conditional distribution p(Z, AX,» where Z, = {Z,,Z,, ...,Z,} denotes all observations 
up to time &. The observation model measures an attribute of the object appearance, 
such as the color histogram (similar to MS tracking). Last, the mean state of the 
object is estimated at each time step. The re-sampling step allows replacing occluded 
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pixels with some newly uncovered pixels within the tracked object. The algorithm 
can be summarized step-by-step as [Num 03, Bra 07]: 


1. Initialization: Select a bounding shape for the object, sample NV random par- 


ticles ee ae from 4 priori distribution p(x,) favoring pixels in/near the box 
[Aru 02]. Set weights {Wi =1/.N}” . For each frame = 0, 1,2, .... 

2. Prediction (Density Propagation): Propagate each particle x), ~ p (x Sa ba ) > 
/=1,..., N, given the dynamic model. 

3. Update 


a. Given the observation z,,,, compute the weights 
WO CC WPL z, RAI xi] for each particle x based on the likelihood 
Tie, EA , which is a function of Plza [ee i 
7 w” 
b. Normalize the weights, WY = = er 


k+l 
E 1 Wi? 


c. ‘The state estimate x,,, is the weighted average of propagated particles 
E 3 @) ii 
Xn T We ett 
i= 
d. In order to suppress particles with low weight estimate the effective num- 
ber of particles 
1 
A 2 
N 
AL 


If N= N pres then perform resampling; otherwise, increment frame k and 
go to step 2. 


Ng 


4. Resampling 
a. Apply the anne aigorithin given in [Aru es 


C= WY = JA l=... 


Particle filtering provides a robust tracking framework, as it considers multiple 
state hypotheses simultaneously. Since less likely states have a chance to temporarily 
remain in the tracking process, particle filters can deal with short-lived occlusions. 


b. Set new weights W, 
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Periodic analysis of the template (in MS tracking) or the observation model (in 
particle filtering) for object appearance variations is required in order to avoid drift. 
In particular, an incremental learning method, using the sequential Karhunen—Loeve 
(SKL) algorithm, which efficiently learns and updates a low-dimensional sub-space 
representation of the target object, has been proposed within the context of par- 
ticle filtering with an affine-motion model [Ros 08]. They observe that at any time 
instance, it suffices to use an eigenbasis to account for appearance variation if the 
object-motion or illumination change is gradual. 


5.4.5 Active-Contour Tracking 


Contour-tracking methods model propagation of active-contour models from 
frame-to-frame. We can classify contour-tracking methods as deterministic-search- 
based methods and probabilistic (conditional-density propagation) methods. 


Condensation (Conditional-Density Propagation) 


The condensation algorithm [Isa 98] is a sample-based Monte Carlo method, where 
the conditional-density propagation concept behind the particle filtering (Section 
5.4.4) was first proposed for detection and tracking of contours of a moving object in a 
cluttered environment. It is proposed to track curves in cluttered background. Hence, 
each sample (particle) ‘ae a is a curve of varying position and shape (instead of a 
single pixel as in Section 5.4.4) with a thickness proportional to the weight W”, and 
the weighted mean of these curves is computed as the state estimate. Compared to 
Kalman filtering, the condensation method is simpler and more general. 


Motion Snake 


Motion snake is a deterministic method that extends the discrete snake formulation 
that was introduced in Section 5.1.5 to fit an active-contour model to a desired 
object by search-based minimization of different energy functions. This section 
introduces new energy functions to model motion of an active contour from frame- 
to-frame [Fu 00, Par 01]. 

We observe that there are two different motions at the opposite sides of a motion 
boundary. This is illustrated in Figure 5.3 where the head of the dancer moves down 
while her arms move up. To this effect, the contour is divided into a number of 
segments such that there will be two candidate affine models, one for each side, 
for each contour segment. We estimate motion vectors inside and outside the con- 
tour on selected pixels along the angular bisectors at each node point as depicted in 
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(a) (b) 


Figure 5.3 Active-contour tracking: (a) contour in frame k; (b) two candidate predicted contour 
locations in frame k + 1 based on inside and outside motion models; and (c) the contour in frame 
k+ 1 that minimizes the prediction energy [Fu 00]. (OIEEE 2000) 





Figure 5.4 Estimation of motion vectors inside and outside the contour [Fu 00]. (OIEEE 2000) 


Figure 5.4. Once we obtain two affine motion models for each segment computed in 
two passes using estimated motion vectors inside and outside the contour segment, 
respectively, each segment has two candidate predicted locations in the next frame as 
depicted in Figure 5.3(b). In order to determine the correct predicted location for a 
segment, we compute a prediction energy for both predicted locations, and select the 
segment location with the smaller prediction energy with a bias favoring the location 
predicted by the motion vectors on the inside of the contour. 

The complete motion-snake algorithm consists of the following basic steps 
[Fu 00]: 


1. Specify an initial object contour in the first frame by marking a set of nodes 
along the desired contour to define an approximate polygonal shape, and then 
snap the approximate contour to fit the desired object tightly by minimizing 
intraframe energy terms. 
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2. Segment the snake into non-overlapping pieces by selecting a set of feature 
nodes based on local curvature, color, and motion vectors using the method 
specified in [Fu 00]. 

3. Estimate motion vectors inside and outside the contour on selected pixels along 
the angular bisector at each node point as depicted in Figure 5.4. 

4. Obtain two predicted locations (based on estimated affine models using motion 
vectors inside and outside of the contour, respectively) for each contour seg- 
ment in the next frame. 

5. Select one of these two predicted locations for each contour segment based on 
minimization of prediction energy. 

6. Refine the predicted contour using interframe and intraframe energy terms, 
and go to step 2 to start processing the next frame. 


In other related work, robust tracking of deformable models over long video 
sequences has been addressed in [Ker 94]. 


5.4.6 2D-Mesh Tracking 


2D-mesh tracking is intended for tracking objects with mildly deformable motion 
without occlusions. A 2D mesh is a planar graph that tessellates (partitions) an image 
region into polygonal patches. The vertices of the patches are called node points. 
Patches are typically triangles or quadrangles, leading to triangular or quadrilateral 
meshes, respectively. Mesh-based motion models differ from block-based models in 
that patches overlap neither in the reference frame nor in the current frame (see 
Figure 5.5). Instead, triangular/polygonal patches in the current frame are deformed 
by the movements of the node points into respective patches in the reference frame, 
and texture within each patch in the reference frame is warped onto the current frame 
using a parametric model as a function of the node-point motion vectors [Tek 98]. 
The process of warping textures from one frame to another is called texture 
mapping. The affine model is used for texture mapping in triangular meshes. If 
proper constraints are imposed for parameter estimation, affine mapping guarantees 
continuity of motion and texture across triangle boundaries. This implies that the 
2D-motion field can be compactly represented by the motion of node points, from 
which a continuous, piece-wise affine motion field can be reconstructed. The mesh 
structure constrains movements of adjacent image patches. Hence, meshes are well 
suited to represent mildly deformable but spatially continuous motion fields [Tok 
96]. However, they do not allow motion discontinuities unless special constraints are 
applied to break the mesh structure at motion/occlusion boundaries [Alt 97]. 


324 Chapter 5. Video Segmentation and Tracking 





(a) (b) 


Figure 5.5 2D-mesh tracking: (a) initial uniform mesh and (b) motion-deformed mesh. 


5.5 Image and Video Matting 


Matting refers to accurate foreground object/subject estimation in images and video. 
It is a key technique to facilitate image and video editing or create novel composites 
for both professional film-production and consumer applications. An image s(x) can 
be represented by a convex combination of a foreground f(x) and background (x) 


s(x) =a f(x) +(1—a) A(x) 


where œ matte can take any value in the range [0,1]. If we consider the special case 
where alpha is binary, i.e., alpha values are only 0 or 1, the matting is equivalent to 
classic image/video segmentation, where each pixel belongs to either foreground or 
background. Matting is a more difficult problem than the classic segmentation, since 
we require extraction of semantically meaningful and pixel-accurate foreground 
objects together with the corresponding alpha values. 

A semantic object may contain multiple colors, textures, motions, and shape 
deformations. Furthermore, the definition of semantic objects may depend on the 
context, which may not be captured by low level features. Hence, one should not 
expect to achieve semantically meaningful object segmentation using fully automatic 
methods based only on low-level features such as color, texture, shape, and motion. 
In general, extraction of semantically meaningful objects requires capture-specific 
information (e.g., chroma-keying) or user interaction. In matting, it is assumed that 
a tri-level image segmentation, called a trimap, which marks each pixel as definite 
foreground, definite background, or unknown is pre-specified by the user. Then, the 
matting problem is only solved for those pixels marked unknown to ensure extrac- 
tion of the desired foreground [Wan 07]. 
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Chroma-Keying 

Chroma-keying, also known as blue-screen matting, is a video-capture technology 
where each video object is recorded individually in a special studio against a key 
color, e.g., blue. The key color is selected such that it does not appear on the object 
to be captured. Then, the problem of extracting the object from each frame of video 
becomes one of color segmentation. Chroma-keyed video capture requires special 
attention to avoid shadows and other non-uniformity in the key color within a 
frame; otherwise, segmentation of key color may become a nontrivial problem. 


Interactive Semi-Automatic Segmentation 


Since chroma-keying requires a special studio and/or equipment, a more practical 
alternative is interactive segmentation using user interfaces to aid a human operator. 
While background subtraction or motion segmentation may result in semantically 
meaningful objects in well-constrained settings, in an unconstrained environment, 
user interaction is indeed the only way to define a semantically meaningful object 
unambiguously because only the user can know what is semantically meaningful in 
a specific context. For example, if a person is running with a ball, whether the ball 
and the person are two separate objects or a single object may depend on the context. 


Image Matting 


Image-matting methods can be classified as i) color-sampling methods, including 
Bayesian matting and the knockout algorithm; ii) affinity-based methods, which 
model the matte gradient, including Poisson matting, random-walk matting, closed- 
form matting; and iii) a combination of both, including robust and geodesic mat- 
ting. The reader is referred to [Wan 07] for a discussion and comparison of these 
methods. 


Video Matting 


We assume that the contour of a semantic object of interest is roughly sketched by 
an electronic pen or just marked by some feature points along its contour in selected 
key frames by a human operator. The approximate initial contours are then snapped 
tightly to the desired object automatically using, for example, the snake method. 
Once the precise boundary of the object of interest is determined in one or more 
keyframes, its boundary in all remaining frames can be automatically computed by 
motion tracking, e.g., by active-contour tracking (see Section 5.4.5). Finally, we 
allow a band of “unknown” pixels of desired width along the tracked contours, where 
high-precision matting problem is solved for these unknown pixels frame-by-frame. 
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5.6 Performance Evaluation 


Comparative assessment of segmentation and tracking results is often based on sub- 
jective judgement, which is qualitative and time consuming. Hence, many studies 
have been performed to associate an objective figure of merit with image/video- 
segmentation and visual tracking results. We can classify these studies as those that 
require ground truth (GT) data and those that don't. 

In practical applications, GT data is rarely available. However, a number of 
databases with ground truth are available for benchmarking studies. The Berkeley 
segmentation dataset contains more than 10,000 hand-labeled segmentations for 
benchmarking image segmentation and boundary detection [Mar 01]. The open 
development environment for evaluation of video systems (ODViS) [Jay 02] allows 
a user to generate GT data for pre-recorded video. Measures that do not rely on GT 
data generally evaluate intra-region homogeneity, inter-region disparity, and spatial 
or spatio-temporal consistency of results. 

A set of performance measures that do not rely on GT data have been proposed 
in [Cor 03] that evaluate intra-object homogeneity based on shape regularity, spatial 
uniformity, temporal stability, and motion consistency. The inter-object disparity 
measures include local color and motion differences with neighboring regions. The 
usefulness of these measures has been demonstrated based on how the results pre- 
dicted by these measures correlate with judgments of human observers. In another 
study [Erd 04], spatial-color contrast along the estimated object boundary, motion 
difference along the object boundary, and color-histogram difference between suc- 
cessive object segmentation masks in the temporal direction were evaluated. These 
measures can be computed per object and per frame, so that it is possible to identify 
the objects and frames that are poorly segmented within a long video. 

Wu et al. [Wu 13] provide 50 fully annotated video sequences and 29 tracking 
algorithms for benchmarking of on-line visual-tracking algorithms. As an alternative 
to creating GT for videos, Black et al. [Bla 03] generate pseudo-synthetic video from 
a set of already compiled GT tracks to evaluate performance of tracking algorithms. 
Empirical standalone methods have been proposed to evaluate performance of track- 
ing algorithms without using ground-truth data [Wu 10, San 12]. The framework in 
[San 12] is divided into two stages: i) estimation of the tracker condition to identify 
intervals during which a target is lost, and ii) measurement of the quality of the esti- 
mated track when the tracker is successful. Successful tracking is identified by analyz- 
ing the uncertainty of the tracker, whereas track recovery from errors is determined 
based on the time-reversibility constraint. Finally, Xu and Corso [Xu 12] discuss 


References 327 


what makes a good super-voxel segmentation method and evaluate the performance 
of some segmentation methods. 
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MATLAB Exercises 


5.1 Background Subtraction 
Given a sequence with at least 20 frames, start processing from frame 11: 
a. Compute the running mean of the previous 10 frames. 
b. Compute the median of the previous 10 frames. 
c. Determine a suitable threshold value to detect moving objects by subtract- 
ing the mean- or median-filtered model frames from the current frame. 
d. Display both the model frames and the difference frames showing moving 


objects. How do you compare mean vs. median filtering to compute model 
frames? Comment on the results. 
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5.2 KLT Tracker 

Given a sequence with N frames, 

a. Determine an initial object template in the first frame. Track the fixed tem- 
plate by affine warping toward successive frames. 

b. Select the same initial template. Track the template this time by updating 
the template every five frames. 

c. Comment on how the tracker with or without template updates behaves in 
the presence of clutter or occlusion. 


5.3 Mean-Shift Tracker 
Given the same video sequence and the initial object template as in Exercise 2, 
implement the MS template tracker. Compare the results with those of Exer- 
cise 2 and comment on the performance of the two methods. 


Internet Resources 


The Berkeley Segmentation Dataset and Benchmark 
http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/ 


ViBe — Background Subtraction 
http://www2.ulg.ac.be/telecom/research/vibe/ 


KLT: An Implementation of the Kanade—Lucas—Tomasi Feature Tracker 
http://www.ces.clemson.edu/~stb/klt/ 


The Condensation Algorithm 
http://www.robots.ox.ac.uk/~misard/condensation.html 


Code and Data for Incremental Learning for Robust Visual Tracking 
http://www.cs.toronto.edu/~dross/ivt/ 


Visual Tracker Benchmark 
https://sites.google.com/site/trackerbenchmark/benchmarks/v10 








CHAPTER 6 


Video Filtering 





The performance of single-frame (image) de-noising and restoration methods may 
be improved by multi-frame filtering, whereas video-format conversion and super- 
resolution reconstruction are inherently multi-frame filtering problems. 


This chapter extends the image-filtering methods covered in Chapter 3 by introduc- 
ing new methods that are specific to video, including multiple-picture (field or frame) 
filtering methods. Multi-frame filters can be classified as linear spatio-temporal fil- 
ters, motion-adaptive filters, and motion-compensated filters. The theory of linear 
spatio-temporal filtering is provided in Section 6.1. Temporal frequency content of 
a video is dependent on the spatial-frequency content of a key frame and its motion. 
Hence, video filters should be designed by taking the motion content of the video 
into account. Because errors in motion detection and motion estimation are unavoid- 
able, a fallback mode that does not use motion information (intra-mode filtering) 
should always be supported for robust processing without visual artifacts. Motion- 
adaptive methods, which require motion detection, and motion-compensated filter- 
ing, which requires true motion estimation, are often designed specific to a problem. 
Multi-frame filters designed for video-format conversion, de-noising, restoration, 
and super-resolution are introduced in Sections 6.2 to 6.5, respectively. 
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6.1 Theory of Spatio-Temporal Filtering 


Temporal-frequency content of a video is dependent on the spatial-frequency con- 
tent of a key frame and the motion content of the video, which is discussed in Sec- 
tion 6.1.1. Hence, spatio-temporal filtering should not be viewed as arbitrary 3D 
filtering in space and time, but should be considered taking the motion content of 
video into account. To this effect, we discuss general principles of motion-adaptive 
filtering with motion detection in Section 6.1.2 and general principles of motion- 
compensated filtering with motion estimation in Section 6.1.3. 


6.1.1 Frequency Spectrum of Video 


We first define a motion trajectory and then derive the frequency spectrum of video 
for the case of global translational motion. Each pixel follows a curve in the (Han) 
space, called a “motion trajectory,” which can be formally defined as a vector function 
c(t; x,,%,, th) that specifies the horizontal and vertical coordinates (x1,x,) at time 7’ 
of a reference pixel (x,,x,) at time Fos Litas (x1, x!) = [a CC bs | 
[Dub 92]. The motion trajectory is illustrated in Figure 6.1. Given the motion tra- 
jectory c(ż x, X> t), the velocity of a pixel (x ,x}) along the trajectory at time z/ can 


be defined by 
v(x, xt’) = ol by 
dt 


The fundamental assumption in motion estimation/compensation is that the 
intensity of a pixel remains unchanged along a motion trajectory, which places a 





Figure 6.1 Motion trajectory passing through pixel (x/, x}) at time t’. 
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constraint on the local spatio-temporal spectrum of video. More specifically, local 
temporal-frequency content of video depends on local spatial-frequency content 
and the motion. We illustrate this concept for constant-velocity translation motion 
below. 


Constant-Velocity Global Translation 


A simple model of image-plane motion is a global translation with constant velocity 
(vi v), where frame-to-frame intensity variations can be modeled as 


sepan = 5 (x, — 0 b%,— i 0) = s(x —v,t,x,— vt) (6.1) 


where the reference (key) frame is chosen as t=0 and 5,(x,,x,) denotes the 
2D-intensity function of the reference frame. 

In order to derive the spatio-temporal spectrum of video with constant-velocity 
global motion, we first define the Fourier transform of an arbitrary spatio-temporal 
function as (see Chapter 1) 


SB i= [ff 3 Gig th Orr, dey ob (6.2) 
where the inverse Fourier transform relationship is given by 
= | || SR) rea oe HE de 


The support of S (F, F, F) may occupy the entire (F,, F,, F) space for arbitrary 
intensity functions s (x,,x,, t). Next, we substitute (6.1) into (6.2) to obtain 


$.(F,E,F)= SIT sle vtz — vate PA he de dt 


Making the change of variables x’ = x, —v,t, for į = 1,2, we get 


SB E= | she ee re a a e 
which can be simplified as 
SAP FF) = S(T (6.3) 


where S,(F,, F) is the 2D Fourier transform of Sq (%1>*>) and 6(-) is the 1D Dirac delta 
function. Defining F=[F, F, F, ]}' and v= [v] n 1], the delta function in Eqn. (6.3) is 
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non-zero only when its argument F'v=0. The delta function thus confines the sup- 


port of S (F, F, F) to a plane in F given by 
hehe t a 0 


which passes through the origin, and is orthogonal to the vector v. The spectral sup- 
port of video with uniform velocity global motion is depicted in Figure 6.2(a). 

This result is intuitively clear. Since the reference frame is sufficient to determine 
all future frames by the nature of the motion model, we expect that the Fourier 
transform of the reference frame would be sufficient to represent the entire spatio- 
temporal-frequency spectrum of video. 

The extent of the support of the continuous video spectrum S (F, F, F) on the 
plane F'v=0 is determined by the support of S (Fi F). We assume that 5)(x,,x,) 
is bandlimited, i.e., S)(F,,F,)=0 for |F |>B, and |F, |>B,, then clearly, s (x,,x,, 2) 
is also bandlimited in the temporal variable, i.e., S (F, F,,F,)=0 for |F|>B,, where 


B= Biv, + Bv, 


For simplicity of illustration, the projection of the support of S (F, F, F) into the 
(Fp F,) plane, defined by F v, +F,=0, is depicted in Figure 6.2(b). 

Even if the global motion assumption is not valid, this motion model is com- 
monly used for local video blocks in many applications, including international 
video compression standards. Hence, the spectrum model derived here is often 





Figure 6.2 Spectral support of video with uniform velocity global motion: (a) in the (F,, F,, F) 
space and (b) projection of the spectral support onto the (F,,F,) plane. 
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applicable to short-term (windowed) Fourier transform of local video blocks. If the 
motion within a block deviates from uniform translation, then the spectrum is still 
concentrated around a plane determined by the best translation vector. 


6.1.2 Motion-Adaptive Filtering 


Motion-adaptive filtering refers to employing different filtering strategies in 
the presence and absence of motion without the need for motion estimation. 
Motion-adaptive filters may employ explicit or implicit motion adaptation. 
Explicit schemes employ motion detection, and the form of the filter depends on 
the value of the motion-detection function. Several motion-detection methods, 
ranging from simple frame difference to more sophisticated temporal-integration 
schemes, for progressive video frames have been discussed in Section 5.2.2. For 
interlaced video input, we employ field differences, between the same parity fields, 
for motion detection. In implicit motion-adaptive filtering, there is no explicit 
motion detection, and motion adaptivity is inherent in the filter structure, such as 
the case in median filtering. We elaborate on some commonly used motion adap- 
tive de-interlacing and frame-rate conversion algorithms in Sections 6.2.2 and 
6.2.3, respectively. 


6.1.3 Motion-Compensated Filtering 


Motion-compensated (MC) filtering refers to filtering along motion trajectories at 
each pixel of each frame. Although MC filtering can be defined for arbitrary motion 
trajectories, it is the optimum linear shift-invariant filter for video with constant- 
velocity global motion. This is because the support of the spatio-temporal frequency 
response of a properly MC linear shift-invariant filter matches the support of the 
spatio-temporal frequency spectrum of video with constant-velocity global motion. 


Arbitrary-Motion Trajectories 
We define MC filtering along an arbitrary-motion trajectory by [Dub 92] 


(x, Xt) = F {s, TA Ce eae TA (Tiar) (6.4) 


where F is a 1D linear or non-linear filter along the motion trajectory passing 
through (x,,x,, £). In the most general case, a different motion trajectory c(T; x}, x}, f) 
may be defined for each pixel (x,,x,) of frame at t, and 7 ranges over all frames within 
the temporal support of the filter. The resulting MC filter is shift-invariant only if all 
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motion trajectories at each pixel of each frame are parallel to each other. This is the 
case when we have constant-velocity global motion. Motion-compensated filtering 
in the case of global but accelerated (in time) motion, or in the case of any space- 
varying motion, will result in a shift-varying filter. Obviously, a spatio-temporal 
frequency domain analysis of MC filtering can only be performed in the linear shift- 
invariant case, which is discussed next. 


Linear Shift-Invariant Filtering in the Case of Constant-Velocity Global Motion 


Given the velocity (motion vector) estimate (v,,v,), the impulse response of an MC 
spatio-temporal filter can be expressed as 


bod = h(t) 5%, — 2, 4,x,—2,2) (6.5) 


where b(t is the impulse response of the 1D filter applied along the motion 
trajectory. 

Hence, linear shift-invariant filtering operation along a constant-velocity motion 
trajectory can be expressed as 


(X15 %>5t) = [ff hak, “HTZ Tris (A Be, at — 7) ee, de, dT 
= "i h (T)s (xi — vT, x, —v,T,t>T)dT 


The frequency response of the MC filter can be found by taking the 3D Fourier 
transform of the impulse response (6.5) 


A(F 2 F)= Sf b (t)8(x, vt moa) PPO he dx, de 
== fh (eje Pr ant y 


= (Fv, + fv, EE) (6.6) 


Observe that the frequency response of the filter H(F,,F,,F.) is constant on 
planes F\v,+F,v,+F=F. If the spectrum of the input is confined to the plane 
Fv, + F,v,+ F.=0, then the effect of filtering is simply multiplication by H,,(0). 

In general, the spectrum of input video extends out of the plane due to deviation 
of motion from constant-velocity translation and the presence of occlusion and noise. 
Then, a low-pass filter H,(F) is employed to attenuate frequencies out of the plane, 
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Figure 6.3 Frequency response of the MC filter projected onto the (F,, F,) plane. 
The passband has width 2B, and slope —v,. The solid line is the spectrum of input 
with matching velocity. The dotted lines indicate the tolerance range A. 


which results in a spatio-temporal low-pass filter H(F,, F, F) with a finite passband 
around the plane. The passband of the 3D spatio-temporal filter projected onto the 
(F,,F,) plane is a parallelogram. The projected ideal filter H (Fp F,) is given by 


—-B,< F <B, and “Uh BS f, SOF Te, 
otherwise 


1 
nal; (6.7) 


whose support is depicted in Figure 6.3. Proper motion compensation is achieved 
when v, matches the velocity of the input video. The case v, =0 corresponds to pure 
temporal filtering with no motion compensation. 

Because the ideal filter is unrealizable, it needs to be approximated, usually by an 
finite impulse response (FIR) filter, which poses a trade-off between filter length and 
passband width. In practice, the number of frame stores is limited, which necessitates 
the use of short temporal filters. Typically zero-order hold or pixel averaging along 
the motion trajectory is used. This causes a wider than desired transition band in the 
temporal-frequency dimension, which limits the aliasing or noise-rejection capabil- 
ity of the filter in resampling and de-noising applications, respectively. A number of 
other filter design issues for MC filtering are addressed in [Gir 93]. 


Sensitivity to Errors in Motion Estimation 


Because the MC low-pass filter has a finite passband width B,, as shown in Figure 6.3, 
an MC filter based on an estimated velocity v,, is capable of successfully filtering 
video with actual velocity within a range [v,,—A,v,+A], where the tolerance A 
depends on B, [Gir 85]. A filter with a wider passband yields a larger tolerance to 
motion-estimation errors, but may also pass spectral replications or noise within 
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passband range. A smaller passband means better suppression of spectral replications 
or noise, but smaller tolerance to motion-estimation errors. 


Reliable Motion Estimation 


MC filtering has become practical after advances in hardware and software motion 
estimation solutions in the late 1990s. Various motion-estimation methods, includ- 
ing forward and backward block-matching, phase-correlation, and optical-flow 
methods, were discussed in Chapter 4. In MC filtering, estimation of true motion 
is preferred over finding the local minimum of a criterion function. Hence, several 
motion estimation schemes have been optimized for particular MC filtering applica- 
tions [Bie 86, Caf 90, Tub 93, Haa 93, Yam 94, Hei 11]. 

In particular, MC de-interlacing and frame-rate conversion requires estimation 
of motion trajectories that pass through missing pixel locations. Symmetric block- 
matching is an extension of block-matching that is developed for this purpose. It is 
illustrated in Figure 6.4, where blocks in two existing neighboring frames/fields 4 一 1 
and k+1 are moved symmetrically, so that the line connecting the centers of these 
two blocks always passes through the missing pixel of (x,,x,) in frame & to define the 
motion trajectory for the missing pixels [Tho 89]. 

The accuracy and consistency of the motion estimates is probably the most impor- 
tant factor in the effectiveness of MC filtering. Thus, some kind of post-processing is 
usually applied to the estimated motion vectors to improve their accuracy. 






current frame/field & 
with missing samples 


Figure 6.4 Symmetric block-matching for MC up-sampling applications. 
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Post-Processing of Motion Estimates and Occlusion Handling 


The accuracy of the motion estimates is crucial for MC filtering; hence, post- 
processing for removing outlier motion estimates and occlusion handling is neces- 
sary in order to decide whether to use MC filtering or go into a fallback mode of 
motion-adaptive or intra-filtering at all pixels. We present two alternative approaches 
to test the accuracy of the motion estimates: an occlusion-detection method and a 
displaced frame-difference (DFD) method. Note that these methods are applied to 
the two frames used in symmetrical block-matching. 

Occlusion detection is based on the assumption that a corona of motion vectors 
is located around moving objects. This assumption, coupled with the observation 
that the correct motion vectors from frame k to k+1 map into “changed region” 
CD(k, k+1), leads to the following procedure: If an estimated motion vector from 
frame k to k+1 maps to the outside of the changed region, it indicates a pixel in 
frame & that will be covered in frame k+1. Likewise, if a motion vector from frame 
k+1 to k maps to outside of the changed region, it indicates a pixel in frame k+1 
that is uncovered. Such motion vectors are unreliable and should be discarded. 

An alternative method to detecting unreliable motion vectors is to test the DFD 
between the frames k and &+1, where all vectors yielding a DFD above a pre-speci- 
fied threshold are discarded. The unreliable motion vectors can be replaced by a set of 
candidate motion vectors if any of them yields a DFD that is less than the threshold. 
The candidate vectors can be determined based on the analysis of the histogram of 
the reliable motion vectors [Lag 92]. If no reliable replacement motion vector can be 
found at a pixel, it is marked as a motion-estimation failure, where a fallback-mode 
motion-adaptive or an intra-frame filter is employed. 


6.2 Video-Format Conversion 


A video format consists of a spatial resolution (picture size) and a frame rate (fre- 
quency) as discussed in Chapter 2. In the era of digital multimedia, various progres- 
sive and interlaced video formats are used to capture, store, transmit, and display 
digital video, including those for SD/HD TV broadcast, digital cinema, web media, 
phones, and camcorders. Format conversion is required to ensure interoperability 
of various applications by decoupling the spatio-temporal resolution requirements 
of the source from that of the display. The task of converting digital video from 
one format to another is referred to as video-format (standards) conversion, which 
includes interlacing/de-interlacing and frame/field rate down-/up-conversion. Both 
frame/field rate conversion and de-interlacing are based on the same principle of 
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Figure 6.5 Video-format conversion problems: (a) frame-rate up-conversion; (b) field rate (scan) 
up-conversion; and (c) de-interlacing. In all figures, filled circles show existing image 
lines and open circles indicate missing lines to be interpolated. 


sampling-structure conversion; the only difference between them is the structure of 
the input and output lattices. Video-format conversion is necessary in such applica- 
tions as displaying 50i/60i broadcast video in 100/120 Hz interlaced or progressive 
displays to reduce flicker, exchanging broadcast video content between 50 Hz and 60 
Hz countries, converting digital cinema at 24 fps to 50/60 Hz TV broadcast formats, 
and in post-production workflow when combining content shot at different frame 
rates or inserting overlays and special effects. 

Most of the common video-format conversion problems are illustrated in 
Figure 6.5. Frame and field rate up-conversion increase the temporal sampling rate 
in progressive and interlaced video, respectively. De-interlacing refers to up-conver- 
sion from interlaced to progressive video. In addition to enabling reuse of video cap- 
tured at different frame rates, frame-rate up-conversion generally yields higher visual 
quality through better motion rendition and less flicker, and de-interlacing provides 
improved spatial resolution in the vertical direction. 

All video-format conversion problems deal with sampling structure conversion, 
which was first introduced in Section 1.5, where it was discussed for arbitrary input 
and output sampling structures and arbitrary signals within the framework of lin- 
ear filtering. Here, we extend that framework by incorporating motion models that 
characterize temporal variations in video. We first treat down-conversion in Section 
6.2.1, where we show that unlike the case of still images, in some cases sub-sampling 
without anti-alias filtering may be desirable in video processing, since aliasing can be 
used to recover some frequencies beyond the Nyquist frequency for super-resolution 
reconstruction. It is clear from our discussion in Section 6.1 that video-format con- 
version requires designing truly spatio-temporal filters, taking into consideration the 
spatio-temporal bandwidth of video before and after the conversion. However, in 
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many practical applications, spatial and temporal filtering are considered separately 
for ease of design and implementation. When performing MC frame-rate conver- 
sion starting with an interlaced source, de-interlacing is often employed first for 
correct motion rendering, even if the desired output format is also interlaced. For 
MC filtering, often a spatio-temporal filter in the direction of motion is used. We 
introduce some commonly used filters for de-interlacing and frame-rate conversion 
in Section 6.2.2 and Section 6.2.3, respectively. 


6.2.1 Down-Conversion 


Recall from Section 1.5 that down-conversion refers to anti-alias filtering (optional) 
followed by down-sampling (sub-sampling). Suppose we down-sample from an MD 
input lattice A, with the sampling matrix Vi to an MD output lattice A, with the 
sampling matrix V,. We define an MXM sub-sampling matrix S = V 'V, such that 
V,=V5S. Since V, is always invertible, a sub-sampling matrix can always be defined. 
Then, the sites of the output lattice A, can be expressed as 


y= V\Sa, ne ZM (6.8) 
For example, spatio-temporal sub-sampling according to the matrix 


PA CS | 
S=/|0 1 0 
0-0 -1 


corresponds to 2:1 interlacing, i.e., discarding even and odd columns in alternat- 
ing frames, respectively. Down-conversion with or without anti-alias filtering is 
discussed next. 


Down-Conversion with Anti-Alias Filtering 


Should the Fourier spectrum of input video extend to outside the unit cell of the 
reciprocal lattice Vý, an anti-alias filter is néeded before down-sampling in order 
to avoid aliasing. The anti-alias filter attenuates all frequencies within the recipro- 
cal lattice vi that fall outside of the unit cell of the reciprocal lattice V =(V. Sy so 
that the spectral replications centered about the sites of the reciprocal lattice V; do 
not overlap with each other. However, because of this anti-alias filtering, the high 
frequencies in the input video are permanently lost, and a viewer whose eyes track 
a moving object can perceive spatial blurring. Furthermore, if the video needs to be 
up-converted for display purposes, the original spectral content of the video cannot 
be recovered even by using ideal MC-reconstruction filters. This is illustrated in the 
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“Comparison of Down-Conversion with or without Anti-aliasing” example below. 
Clearly, in this case, successive frames contain no new frequency information, and 
a simple linear shift-invariant low-pass filter is as good as a more sophisticated MC- 
reconstruction filter. 


Example: Conversion from ITU-R 601 4:2:2 to ITU-R 601 4:2:0 
Format 


Conversion from 4:2:2 interlace studio format to 4:2:0 interlace for com- 
pression and broadcast requires decimation of both chroma channels verti- 
cally by a factor of 2. This is complicated by the interlaced nature of the 
source, since simple vertical decimation by 2 of each field would not preserve 
the spatial offset between the lines of two fields. It can be treated by filtering 
one of the chroma fields (in the vertical direction) using a filter of odd length 
before eliminating every other line and processing the alternate chroma field 
by using an even length filter to offset the locations of lines by 1/2 sample. 
The odd- and even-length filter impulse responses are [-29, 0, 88, 138, 88, 0, 
-29]/256, and [1, 7, 7, 1]/16, respectively. Note that the luma channel is not 
modified in this conversion. 


Down-Conversion without Anti-Alias Filtering 


Down-conversion without anti-alias filtering refers to simply discarding a subset 
of the input samples specified by a sub-sampling matrix S. Unlike the case of still 
images, this process can preserve the high-frequency content of the original video 
except for the case of critical velocities, provided that we have global, constant-veloc- 
ity motion. Even if the video technically contains aliasing, a viewer whose eyes track 
the moving object does not see aliasing artifacts and perceive the moving object with 
the original spatial detail. Furthermore, MC linear shift-invariant filtering can be 
employed for subsequent up-conversion of video without any loss of spatial resolu- 
tion. This is demonstrated by the following example. 


Example: Comparison of Down-Conversion with or without 
Anti-aliasing 

For the sake of simplicity, let’s consider a single spatial (horizontal) coordi- 
nate, x, and a time coordinate, t, and assume we have horizontal motion with 
constant velocity v. Suppose that the output lattice V, is obtained by sub- 
sampling a progressive input lattice V, according to the sub-sampling matrix 
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which corresponds to discarding even and odd spatial samples in alternating 
time samples. The spatio-temporal (x, ¢) input and output lattices in this case 
are depicted in Figure 6.6. 

The spectrum of the input video, assuming global, constant-velocity 
motion, is depicted in Figure 6.7(a), where the solid dots indicate the sites of 
the reciprocal lattice VY and the slope of lines are determined by the velocity v. 
The dotted lines denote the support of an ideal anti-alias filter. The spectra of 
the down-converted signal with and without anti-alias filtering are shown in 





(a) 


Figure 6.6 Two-dimensional (a) input and (b) output lattices. 





(oO 


Figure 6.7 Fourier spectrum of (a) input signal; (b) sub-sampled signal with anti-alias filtering; and 
(c) sub-sampled signal without anti-alias filtering. 
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Figure 6.8 Critical velocities in the case of de-interlacing: (a) the spatio-temporal 
domain and (b) the Fourier domain. 


Figures 6.7(b) and (c), respectively. The loss of high-frequency information 
when an anti-alias filter has been employed can be clearly observed. Fur- 
ther observe that the spectrum of the down-converted signal in Figure 6.7(c) 
retains the high-frequency content of the original signal, provided that we do 
not have a critical velocity, discussed below. 


If the velocity v is such that the orientation of the lines (spectra) aligns with sites 
of the reciprocal lattice, it is called a critical velocity. Then, the replications overlap 
with each other, resulting in aliasing (loss of information) as shown in Figure 6.8(a). 
Given the sub-sampling matrix, we can easily determine the critical velocities. For 
the sub-sampling matrix in the “Comparison of Down-Conversion with or without 
Anti-aliasing” example, the velocities v, = 2i+1, i e Z, are critical velocities. For 
these velocities, the motion trajectory passes through either all existing or non-exist- 
ing pixel sites at every frame in spatio-temporal domain, as shown in Figure 6.8(b). 
Note that with proper anti-alias filtering, the replicas in the frequency domain can 
never overlap; hence, there are no critical velocities. 

In conclusion, down-conversion to sub-Nyquist rates followed by up-conversion 
without loss of resolution is possible only if no anti-alias filtering has been used in 
the down-conversion, and we do not have a critical velocity. If the estimated veloc- 
ity vector at a particular frame or a block is close to this critical velocity, we need to 
choose one of the following options before sub-sampling: 


1. Anti-alias filtering: The spatial resolution is sacrified at a particular frame or 
block of pixels when the estimated motion is close to the critical velocity. 

2. Adaptive sub-sampling: If we are allowed to change the sub-sampling lattice, 
given an estimate of the motion vector, we may change the sub-sampling matrix 
to avoid critical velocities assuming global, constant-velocity motion. 
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Figure 6.9 Intra-field filtering. 


6.2.2 De-Interlacing 


The process of interlace-to-progressive scan conversion is called de-interlacing. De- 
interlacing techniques complete each field to a frame by estimating (interpolating) 
the missing lines, which are depicted by open circles in Figure 6.5(c). De-interlacing 
methods can be classified as intra-field vs. inter-field methods, where the latter can 
be further classified as motion-adaptive and MC methods. 


Intra-Field De-Interlacing 


Intra-field de-interlacing refers to interpolating missing lines from the available lines 
within a single field. A field of an interlaced video is depicted in Figure 6.9, where 
each circle denotes the cross-section of a complete line of video. The filled circles 
denote lines that are available, and the open circles show lines to be interpolated. 
Intra-field de-interlacing methods include linear (vertical) or edge-adaptive interpo- 
lation filtering. Intra-field filters do not generate motion artifacts; however, they may 
cause aliasing or loss of vertical resolution mainly around horizontal edges. 


Linear Interpolation 


These methods are called “bob” filtering in the computer industry. Let s(x,,x,, t) 
i=e,0, denote the even and odd fields, respectively, at time ż, such that AX,.X>st,) is 
zero for odd values of x,, and s(x,,x,,t,) is zero for even values of x,. 


Line Repetition Assuming that the index of the first line of each field is zero, the 
line-repetition algorithm can be described by 


He. Ht) = (0 = lt) forse odd (6.9a) 
1>*2 r% 2 
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and 


Ax ott) = Heat TLE) for x, even (6.9b) 
Line Averaging The line-averaging algorithm is given by 
Slika) = 了 [aa SLEITE T 区 动 | for i=e,0 (6.10) 


The line-repetition algorithm may result in jagged edges, while the line-averaging 
algorithm may cause undesired blurring. 


Edge-Adaptive Intra-Field Interpolation 


Edge-adaptive interpolation methods have been proposed to avoid blurring near 
edges [Lim 90, Lee 94]. In the edge-adaptive approach, each line of video in a field/ 
frame f, is locally modeled as a horizontally displaced version of the previous line in 
the same field/frame as 


s(x; ae ak ji ETA = s(x, i, + fox, +1,t9) 


where d denotes the local horizontal displacement between two consecutive even or 
odd lines. This model suggests a 1D displacement-compensation problem where x, 
takes the place of the time variable in the motion compensation problem. The dis- 
placement d at each pixel can be estimated by using symmetric line-segment match- 


ing about (x,,x,) in order to minimize the sum absolute difference (SAD) given by 








SAD(4) = 2 s(x, r Ae he, 14) alam +4/ + j,x, + TA (6.11) 
or through the pixel-flow relationship [Lim 90] 
a a 二 本 本 = (6.12) 


Ox, 


which is similar to the optical-flow equation for the case of vertical edge-displacement 
estimation. Then an edge-adaptive contour-interpolation filter can be defined as 


HM, at) = 了 [s(x =4, 5%, 对本 +4, x, +14,)| (6.13) 


for i=e,0, where e and o denote even and odd fields, respectively. This filter seeks 
those two pixels in the two neighboring lines that most likely belong to the same 
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Figure 6.10 Demonstration of (a) linear vs. (b) edge-adaptive vertical interpolation. 


image structure, i.e., on the same side of the edge, and averages them. The fact that 
the filter is capable of preserving a 45-degree edge, unlike the linear averaging filter, 
is demonstrated in Figure 6.10. The crucial step here is the accurate estimation of 
local edge-displacement (orientation) values. 

A hardware implementation for edge-adaptive line interpolation has been pro- 
posed [Lee 94]. Intra-frame filtering methods lead to simple hardware realizations. 
However, they are not well suited to de-interlacing in stationary regions, where spa- 
tial averaging usually causes blurring of image details, hence the need for motion- 
adaptive de-interlacing. 


Inter-Field Temporal De-Interlacing (Weave Filtering) 


The simplest de-interlacing method is merging even and odd fields of a frame, i.e., 
copying samples as shown by the horizontal arrow in Figure 6.11(a), which yields 
NI2 progressive (composite) frames from N fields, which are then replicated to 
obtain N progressive frames. Composite frames provide perfect vertical resolution in 
stationary image regions, but they suffer from line-crawling artifacts in regions of fast 
motion. Field merging is called weave filtering in the computer industry. 





t 
two-field three—field 


(a) (b) 


Figure 6.11 Motion-adaptive de-interlacing: (a) two-field three-pixel median filter and 
(b) three-field filter with motion detection between the same parity fields. 
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Motion-Adaptive De-Interlacing 


We consider two examples of motion-adaptive filtering: an implicitly motion adap- 
tive filter and one with explicit motion detection. 


Two-Field Median Filter An example of an implicit motion-adaptive filter that 
does not require motion detection is the three-point median filter, whose support is 
shown by the arrows depicted in Figure 6.11(a) [Cap 90, Haa 92] 


5(x,,%,,t,) = Med je -2 x, -ras + tb -1)| (6.14) 


It is popular due to its computational simplicity and its edge-preserving property. 


Bob-and-Weave Filter In order to obtain the best performance in both moving and 
stationary regions, we consider motion-adaptive filtering, which switches between 
weaving and intra-frame interpolation [Haa 98] or linearly blends them based on a 
motion-detection function [Sch 87]. A three-field bob-and-weave filter, whose sup- 
port is depicted in Figure 6.11(b), is given by 


865958) =aslx -2 x, -1s]+ ps +o 41 ys —1) (6.15) 


where parameters œ and B are determined based on the value of a motion-detection 
function, 


a = 0.5,B = 0.5, y = 0 if motion is detected (Bob) 
a = 0,B = 0,y = 1 if motion is detected (Weave) 


and parameter d, given by Eqn. (6.12) or (6.13), enables edge-adaptive intra-frame 
interpolation. 

We can employ three- or four-field motion detection. Three-field motion detec- 
tion can be obtained by thresholding the difference between two fields of the same 
polarity (even-even or odd-odd) as depicted in Figure 6.11(b), whereas the four-field 
motion detection takes the logical OR of thresholded differences of the respective 
even-even and odd-odd fields. 

Motion-adaptive methods provide satisfactory results, provided that the scene 
does not contain global camera motion or fast-moving objects. In the presence of 
camera motion or fast-moving objects, MC filtering is needed for better results. 
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Motion-Compensated De-Interlacing 


Motion compensation aims to transform a video with global or local motion into 
a stationary sequence. We investigate MC filtering under two cases: i) global MC 
filtering in the case of camera motion, and ii) general MC filtering in the case of 
fast-moving objects, which typically requires a different motion trajectory at each 
pixel. 


Global-Motion-Compensated De-Interlacing 


In the case of a global camera motion or camera shake, a hybrid de-interlacing 
method can be employed, which first compensates for the dominant global motion, 
and then applies a motion-adaptive filter on this MC image to account for any 
residual motion [Pat 97a]. The block diagram of a three-field hybrid de-interlacing 
filter is depicted in Figure 6.12, where three consecutive fields are assumed to be an 
even field £}, odd field O,, and even field E,. 

In the first stage, a global-motion vector between the fields £, and Æ, is estimated 
using the phase-correlation method over four rectangular windows that are located 
near the borders of the fields, so they are most likely only affected by global motion. 
Next, the fields O and E, are motion compensated with respect to E, to generate 
O! and E}, respectively. The motion-compensation step aims to create three con- 
secutive fields, E,, Of, and Ej, where the global motion is eliminated. Subsequently, 
the three-field motion-adaptive bob-and-weave filter is applied to the field sequence 
E,, Ol, and Ej. A judder post-processing step, proposed by Zaccarin and Liu [Zac 
93], can also be included [Pat 97a]. Judder refers to edge-misalignment artifacts, as 
shown in Figure 6.13, caused by incorrect motion vectors. 

Judder detection may be implemented by detection of the staircase pattern 
in edge locations, which is shown in Figure 6.13(b). In the post-processing stage, 
motion vectors at the pixels where judder is detected are deemed unreliable, and the 
corresponding pixels are replaced by spatially interpolated values. 
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Figure 6.12 Motion-compensated/adaptive de-interlacing. 
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Figure 6.13 Illustration of judder: (a) no judder and (b) judder (line crawling) present. 


General-Motion-Compensated De-Interlacing 


The basic concept of MC filtering is to perform filtering along motion trajectories 
passing through missing pixels. Motion-compensated up-conversion is capable of 
producing video with higher spatio-temporal resolution than the input, and free 
from aliasing, when the original source is spatio-temporally aliased, as discussed in 
Section 6.2.1, provided that i) the motion model is accurate at least on a local basis, 
ii) the motion estimates are accurate, and iii) we do not have a critical velocity. 
The performance of an MC filter is closely related to how well we deal with inac- 
curate motion vectors, occlusions, and critical velocities. The procedure consists of 
three steps: i) true (sub-pixel) motion estimation, ii) post-processing of motion vec- 
tors, and iii) choice of filter. We presented motion estimation and post-processing 
of motion vectors for MC filtering in Section 6.1.3. In the interest of real-time 
implementation with a small number of frame stores, often a simple zero-order hold 
(weave) filter is employed along motion trajectories. Here, we present some varia- 
tions of MC zero-order hold filtering. 


Backward Extension of Motion Vectors In MC zero-order hold (field-insertion) 
filtering, spatial interpolation within the previous field is required to accomodate 
sub-pixel motion vectors. In an attempt to minimize the need for spatial interpola- 
tion, Woods and Han [Woo 91] extend the search to find a match with one of the 
existing samples over two previous fields as shown in Figure 6.14. The lines indicate 
possible motion vectors (MV) and the pixel to be copied for each MV. For MVs that 
do not point to the proximity of an existing pixel location in either of the previous 
two fields, the average of the two nearest existing pixels in the previous two fields is 
computed as the interpolated value. 


Time-Recursive De-Interlacing Time-recursive (TR) filters use previously de-inter- 
laced frames (instead of input fields only) in MC de-interlacing [Wan 90]. Since 
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Figure 6.14 Zero-order hold filtering extended to two previous fields. 


interpolated samples depend on previous original samples as well as previously inter- 
polated samples, TR filters may cause propagation of interpolation errors into sub- 
sequent frames. MC median filtering, rather than simple MC weaving, has been 
proposed to prevent error propagation. 


Hybrid Methods Successful and robust de-interlacing often requires adaptively 
combining MC and motion-adaptive methods, such as intra-line averaging (bob), 
edge-adaptive intra-averaging, field insertion (weave), and MC forward or backward 
field insertion and MC field averaging [Haa 98]. The fundamental difficulty with 
hybrid methods is to develop a systematic approach for reliable quality ranking of 
individual methods for proper switching. 


6.2.3 Frame-Rate Conversion 


Frame-rate conversion refers to presenting video at a different temporal rate than at 
which it was shot. The most common examples are displaying 50i/60i broadcast TV 
on 100/120 Hz interlaced or progressive panels and broadcasting 24 Hz movies on 
TV in a 50/60 Hz country. Frame-rate conversion techniques can range from simple 
frame replication (pull down) to very sophisticated motion compensated interpola- 
tion (synthesis) techniques. 


24 Hz Movies to 50/60 Hz 


Movies are recorded at 24 frames/sec or 24 Hz. In most theatres, they are projected 
at 72 Hz via a shutter showing the same frame three times. Blu-ray disc specifi- 
cation requires 23.976 Hz 1920 X 1080 video. TV standards vary according to 
geographical location. The UK has a standardized TV frame rate of 50 Hz (fields/ 
sec) or 25 interlaced frames/sec in the 1930s, while the TV frame rate in the United 
States is 60 (59.94) Hz or 30 (29.97) interlaced frames/sec. Pull-down methods are 
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Figure 6.15 Illustration of various pull-down methods. 


frame-repetition methods that insert additional frames in various patterns so the 
duration of 24 Hz (23.976 Hz) movies is unchanged when shown on 50/60 Hz TV 
broadcasts. 

In a 50 Hz European TV, 24 Hz movies are simply played back as if they are 
25 pictures/sec by splitting each picture into two fields. This is called 2:2 pull- 
down, which is illustrated in Figure 6.15. The duration of the movie is reduced by 
1/24 every second, but we do not notice this. Note that the audio pitch is shifted 
accordingly. 

In the 60 Hz TV world speeding up a 24 Hz source to 60 Hz would be visible; 
therefore, 3:2 or 2:2:2:4 pull-down methods have been invented. In the 3:2 pull- 
down, each odd frame of digital motion picture is repeated three times and each 
even frame is repeated twice, or vice versa. These pictures are then interlaced yield- 
ing a 60 Hz field rate from a 24 Hz input source. Alternatively, 2:2:2:4 pull-down 
repeats the first three frames twice and the fourth frame four times periodically, 
before interlacing each picture. These repetition patterns are depicted in Figure 6.15. 
The pull-down methods introduce temporal aliasing that results in jerky motion ren- 
dition. Such motion artifacts may be visible with bigger displays and high-resolution 
video formats, where more sophisticated motion-adaptive or MC frame/field-rate 
conversion algorithms may be needed. 

The inverse pull-down methods can be used to recover a 24 Hz progressive 
source from a 60 fields/sec interlaced video that was generated from motion pictures 
using 3:2 or 2:2:2:4 pull-down methods. The inverse pull-down methods have been 
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Figure 6.16 50 to 60 Hz conversion by weighted averaging (blending) of frames. 


adopted by the Moving Picture Experts Group (MPEG) as a pre-processing step for 
reduncy reduction in compression of 60 Hz interlace video that was generated from 
motion pictures using the 3:2 pull-down method. 


50 to 60 Hz Conversion 


Conversion from 50 Hz to 60 Hz requires replicating a picture (an even and an 
odd field) every five pictures. Conversion from 60 Hz to 50 Hz may be achieved by 
dropping a picture every six frames. Smoother results can be obtained by weighted 
averaging interpolation (blending) in time (shown in Figure 6.16), rather than frame 
dropping or replicating, to achieve the desired field rate. 


Scan-Rate Doubling 


Doubling of the field rate (also called the scan rate) has been adopted by most TV 
manufacturers of 100 Hz receivers to improve visual quality. In digital TV receivers, 
it is easy to replicate each field twice to achieve scan-rate doubling. There exists more 
than one way to repeat the fields. For example, an odd field may be repeated to form 
the next even field, and an even field is repeated to form the next odd field. This 
method has reasonably good performance in moving scenes, but poor results in sta- 
tionary regions are inevitable. Alternatively, one can repeat an even field to form the 
next even field, and an odd field to form the next odd field. This strategy is optimal 
for stationary scenes but fails in moving regions. 

A linear shift-invariant inter-frame filtering strategy for frame/field rate up-con- 
version would be frame/field averaging, where a missing frame is replaced by the 
average of its two neighboring frames/field. Averaging improves the SNR in station- 
ary regions; hence, it is superior to simple frame/field repetition. However, it may 
introduce ghost artifacts in moving regions. 
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Figure 6.17 Interlace to interlace scan-rate conversion. 








Figure 6.18 Dark circles indicate lines existing in the input odd and even fields. Even fields 
e* are interpolated by the three-point filter. Light-color circles indicate output odd field 
lines interpolated from the input even field lines by intra-frame line averaging. 

Even lines indicated by dark circles in the field o* are discarded. 


A simple scan-rate conversion method that preserves temporal sequencing of 
the fields is sequential intra- and inter-field line averaging, which is illustrated in 
Figure 6.17 and Figure 6.18. Only input odd fields (marked o) are retained “as is” 
in the 100 Hz output sequence. The input even fields are converted to odd fields 
(marked o*) in the output video by intra-field line averaging. The interpolated odd 
field lines are shown by lighter circles in Figure 6.18. New even fields (marked e ) are 
interpolated between o and or by inter-field filtering (blending). A three-point aver- 
age or median filter, shown by the arrows in Figure 6.18, is used for interpolation of 
even fields. The three-point filter uses existing lines in the input odd and even fields. 

Intra-field line-averaging performs reasonably well in moving regions. How- 
ever, it introduces blurring in stationary regions because it is a spatial filter. Notice 
that field repetition and inter-field averaging are optimal (in noise-free and noisy 
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cases, respectively) in stationary regions. However, while field-repetition algorithms 
yield jagged edges, inter-field averaging may introduce ghost artifacts in moving 
regions. Obviously, non-adaptive, non-MC algorithms cannot be perfect for both 
stationary and moving regions, which suggests the need for motion-adaptive or 


MC filtering. 


Motion-Adaptive Scan-Rate Conversion 


One way to avoid blurring in stationary regions and ghosting in moving regions is 
to employ an adaptive filter where the filter impulse response is determined locally 
based on a motion-detection function. For example, we can replicate the same parity 
fields in stationary regions and perform three-point weighted averaging in moving 
regions. The boundaries of the moving regions can be estimated by using a motion- 
detection function, which may simply be the frame difference as in change detec- 
tion. Because no optimal strategy exists to determine the filter weights in terms of 
the motion-detection function in moving areas, several researchers have suggested 
the use of spatio-temporal median filtering. Median filtering is known to be edge- 
preserving in still-frame image processing. Considering the effect of motion as a 
temporal edge, spatio-temporal median filtering should provide motion adaptivity. 
In scan-rate up-conversion, we employ a three-point spatio-temporal median 


filter described by 
5(x,,X,t,.) = Med {s(x, 9X —1,8,),5(%, 5X. FLAD 11 Xp 0b )} (6.16) 


where “Med” denotes the median operation. The three pixels within the support 
of the filter are shown in Figure 6.18 for even-field estimation. The median filter 
is motion-adaptive such that in moving regions the filter output generally comes 
from one of the two pixels that is 1/2 pixel above or below the pixel to be estimated, 
and in the stationary regions it comes from the pixel that is in the same vertical 
position. Thus, it yields reasonable performance in both stationary and moving 
regions. The three-point median filter has been used in improved-definition TV 
receivers for field-rate doubling. Several modifications including a combination of 
averaging and median filters have also been proposed for performance improve- 
ments [Haa 92]. 

In median filtering, each sample in the filter support is given an equal emphasis. 
Alternatively, one can employ weighted median filtering, given by 


5(x,,X,,t,.) = Med {w,Os(x, X —1,t,),w,05(x,,x, +1,t,),w305(x,, x,t. )} 
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where w, denotes the weight of the ith sample and © is the replication operator. That 
is, each sample ż is replicated w, times to affect the output of the median operation. 
To obtain the best possible results, the weights should be chosen as a function of a 
motion-detection signal. An example of a motion-detection signal for scan-rate up- 
conversion and weight combination as a function of the motion-detection signal is 
given by Haavisto and Neuvo [Haa 92]. 


Motion-Compensated Frame/Scan-Rate Conversion 


If frame/field duplication/deletion or simple linear averaging (blending) of frames 
or median filtering do not produce satisfactory results, new frames need to be syn- 
thesized by MC interpolation. There is usually a trade-off between the accuracy of 
spatial-image details (reusing an original frame) and spatio-temporal accuracy (posi- 
tions of objects in a frame at a given time according to motion trajectories). If we 
reuse an existing frame at a slightly different time than it actually belongs to, spatial 
details will be well-preserved but positional inaccuracy may lead to motion jitter. If 
we synthesize a new frame for the exact time position then spatial position of objects 
will be correct (assuming true motion estimation) but some spatial details may be 
blurred due to interpolation errors. A simple method for frame synthesis is MC 
blending (i.e., averaging of pixels from neighbor frames along the motion trajec- 
tory passing through the center pixel). Low-cost digital frame buffers and real-time 
motion estimation hardware/software have made MC filtering practical since the 
late 1990s. Commercial and non-commercial implementations of these algorithms 
are available with good results. ASIC implementations are built into high frame-rate 
TVs under different brand names. 

An MC frame/field rate conversion system consists of several sub-blocks [Bar 
10], which are shown in Figure 6.19. We note that if the source video is interlaced, 
de-interlacing is applied prior to MC filtering for more accurate motion estimation 
and frame rendering even if the desired output is interlaced. Motion-estimation and 
post-processing methods were discussed in Section 6.1.3 and de-interlacing methods 
were covered in Section 6.2.2. Frame/field replication or three-point median filtering 
is often used as a fallback mode. 
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Figure 6.19 Block diagram of an MC scan-rate conversion system. 
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Film-Mode Detection 


If the input video has been converted to 50/60 Hz from an original 24 Hz movie 
by some pull-down method, this can be detected from successive frame similarity 
patterns, and the input video is converted back to 24 Hz progressive format prior to 


MC filtering. 


6.3 Multi-Frame Noise Filtering 


Fundamental concepts that make de-noising possible are self-similarity in the spatio- 
temporal domain and sparsity in a transform domain. Video contains significant 
self-similarity (redundancy) in the temporal dimension, which implies sparsity in 
transform domain (e.g., planar support in the Fourier domain for global translational 
motion as described in Section 6.1). As a result, inter-frame (multi-frame) noise 
filters can provide much better noise reduction than intra-frame filters. Motion- 
adaptive or MC filters can remove noise while avoiding spatial blurring of image 
detail [Dub 84, Sez 91, Boy 92, Liu 10, Mag 12]. 


6.3.1 Motion-Adaptive Noise Filtering 


Motion-adaptive noise filters do not perform explicit motion estimation. They are 
applied over a fixed spatio-temporal support at each pixel. We start this section by 
discussing direct filtering where there is no adaptivity at all, or the adaptivity is 
implicit in the filter design. Next, we discuss filter structures, where some coefficients 
vary as a function of a so-called “motion-detection” signal. 


Direct or Implicitly Motion-Adaptive Temporal Filtering 


The simplest form of direct temporal filtering is frame averaging, where we average 
pixels occupying the same spatial coordinates in consecutive frames. Direct temporal 
averaging provides good results in the stationary parts of a frame, because averaging 
multiple observations of essentially the same pixel in different frames eliminates noise 
while preserving image detail. It is well known that direct averaging corresponds to 
maximum-likelihood estimation when there is no motion, assuming white, Gauss- 
ian noise, and reduces the variance of the noise by a factor of N, where N is the 
number of samples [Uns 90]. It follows that in pure temporal filtering a large num- 
ber of frames may be needed for noise reduction depending on the SNR. Spatio- 
temporal filtering provides a compromise between the number of frames needed for 
noise reduction and the amount of spatial blurring. Direct temporal averaging is not 
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motion-adaptive; hence, it causes smearing and chrominance separation in moving 
areas in the same way that direct spatial averaging leads to blurring around edges. 
A fundamental question that arises is how to distinguish temporal variations due to 
motion from those due to noise, which requires modeling motion. 

Motion-adaptive filtering is the temporal counterpart of edge-preserving spa- 
tial filtering in that frame-to-frame motion gives rise to temporal edges. It follows 
that spatio-temporal noise filters that adapt to motion can be obtained by using 
structures similar to those of the edge-preserving filters. Implicitly motion-adaptive 
filters include directional filters and order statistic filters, such as median, weighted 
median, and multi-stage median filters [Arc 91]. For example, Martinez and Lim 
[Mar 85] proposed a cascade of five 1D FIR linear minimum mean-square error 
(LMMSE) estimators over a set of five hypothesized motion trajectories at each pixel 
that correspond to no motion, motion in the + x, direction, motion in the 一 Xi 
direction, motion in the +x, direction, and motion in the —x, direction. Due to the 
adaptive nature of the LMMSE estimator, filtering is effective only along hypoth- 
esized trajectories that are close to actual ones, 


Motion-Detection Based Filtering 


In motion-detection based filters, the selected filter structure has parameters that can 
be tuned according to a motion-detection signal, such as the frame difference. Both 
FIR and IIR filter structures can be employed in motion-adaptive filtering. A simple 
example of a motion-adaptive FIR filter is given by 


s[n nk] = (1—y)gln,,2,,k] + yg[n,,n,,k —1] (6.17) 


and that of an IIR filter by 


人 (6.18) 
where 


I 
y= max}0,5—al ginsmkl glm k—1]} (6.19) 


is the motion-detection signal and Q is a scaling constant. Observe that these filters 
tend to turn off filtering when a large motion is detected in an attempt to prevent 
artifacts. The FIR structure has limited noise-reduction ability, especially when used 
as a purely temporal filter with a small number of frames, because the reduction in 
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the noise variance is proportional to the number of samples in the filter support. IIR 
filters are more effective, but they generally cause Fourier-phase distortions. 

Implementations of noise filters based on the above structures generally differ in 
the way they compute the motion-detection signal [Den 80]. 


6.3.2 Motion-Compensated Noise Filtering 


The MC filtering approach is based on the assumption that the variation of the pixel 
gray levels over any motion trajectory c(T; x,,x,,2) is due mainly to noise. The 
motion trajectory c(T; Xp Xy t) is a continuous-valued vector function returning the 
coordinates of a point at time 7 that corresponds to the point (x,,x,) at time ż. Thus, 
noise in both the stationary and moving areas of the image can effectively be reduced 
by low-pass filtering over the respective motion trajectory at each pixel. MC filters 
differ according to: i) the motion estimation method, ii) the support of the filter, 
(e.g., temporal vs. spatio-temporal), and iii) the filter structure (e.g., FIR vs. IIR, 
adaptive vs. non-adaptive). 

The concept and estimation of a motion trajectory are illustrated in Figure 6.20. 
Suppose we wish to filter frame & using N frames centered about frame &, given by 
k—M, ..., k—1, k, k+1, ..., R+M, where N=2M+1. The first step is to estimate 
a discrete-motion trajectory e(/4n,,n,,k), I=k—M, ..., k—1, k, k+1, ..., R+M, 
at each pixel (n,n,) of frame k. The function e(4,2,,72,,&) is a continuous-valued 
vector function returning the (x,,x,) coordinates of a point in frame /, which cor- 
responds to the pixel (7,,7,) of the kth frame. The discrete-motion trajectory is 
depicted by the solid line in Figure 6.20 for the case of N = 5 frames. In estimating 





一 一 一 Motion trajectory at (7,7 ,h) 


--~------ > Motion estimation between frames 


Figure 6.20 Estimation of the motion trajectory (M=2, N=5). 
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the trajectory, the displacement vectors are usually estimated in reference to the 
frame & as indicated by the dotted lines. The trajectory in general passes through 
sub-pixel locations, where the intensities can be determined via sub-pixel interpola- 
tion. The support Senn s) of an MC spatio-temporal filter is defined as the union of 
pre-determined spatial neighborhoods (e.g., 3X3 regions) centered about the pixel 
(sub-pixel) locations along the motion trajectory. In temporal filtering, the filter 
support $,, ,. coincides with the motion trajectory c(6 11, 7,, k). Clearly, the effec- 
tiveness of MC spatio-temporal filtering is related to the accuracy of the motion 
estimates. 

Various filtering techniques, ranging from averaging to more sophisticated adap- 
tive filtering, can be employed given the MC filter support. In the ideal case, where 
the motion estimation is perfect, direct averaging of image intensities along a motion 
trajectory provides effective noise reduction [Mar 85]. In practice, motion estima- 
tion is hardly ever perfect due to noise, occlusions, and sudden scene changes, as well 
as changing camera views. As a result, image intensities over an estimated motion 
trajectory may not necessarily correspond to the same image structure, and temporal 
averaging may yield artifacts, hence the need for adaptive filter structures over the 
MC filter support. 

We review two early adaptive MC filters: the MC-LMMSE filter, which turns 
filtering off whenever non-uniformity is detected within the MC support, and the 
MC adaptive weighted averaging (AWA) filter, which is a variation of bi-lateral fil- 
tering that weighs down the effect of outliers causing the non-uniformity. We also 
review two more recent MC filters that are extensions of the non-local means filter 
and BM3D filter studied in Chapter 3 for MC video de-noising. 


Spatio-Temporal Adaptive LMMSE Filtering 


The MC adaptive LMMSE filter [Sam 85, Sez 91, Ozk 93] is an extension of the edge- 
preserving spatial filter that was presented in Section 3.5.3 for the spatio-temporal 
domain, where the local spatial statistics are replaced by their spatio-temporal coun- 
terparts. Then, the estimate of the pixel value at (7,,7,, $) is given by 


5[m,2,,k] = u, [mnk] + 
ka phlot o? (n,.n,,k) +o 


(glm, 1,k]— p71) (6.20) 
where MA.(2 n, k) and Ali n, k) denote the ensemble mean and variance of the 
corresponding signal, respectively. Depending on whether we use spatio-temporal or 
temporal statistics, this filter will be referred to as the LMMSE-ST or the LMMSE-T 
filter, respectively. 
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We assume that signal-dependent noise can be modeled as 
VÓN ny k) = # (ni ny kjuln ny k) 


where u(n1, n, K) is a wide-sense stationary process with zero-mean, which is inde- 
pendent of the signal, and @ is a real number. The film grain noise is commonly 
modeled with a= 1/3, and for signal-independent noise we have a=0. Under this 
model, 1,(7,,75,) = Mn, n, k) and o? (n,,n,,k) = a(n, ,n,,k) -@., where Ts 
denotes the variance of the noise process. 

In practice, the ensemble mean mw (npn > k) and variance Cr ¢ (75% 5k) are 
replaced with the sample mean Å i, (51,5) and variance o;(n,, beck which are 
computed within the support Su ne) as 


fi, (m,2,,k) = Eu in DES, yt glit] (6.21) 
and 
a2 1 . . A 2 
el =F Eu abes, na (gli i11- fa, [n,,7,,1) (6.22) 
where L is the number of pixels in Su ,, ,). Then 
G? (m,m,,k) = max { 62 (n,,2,,k)— 62,0 } (6.23) 


in order to avoid the possibility of a negative variance estimate. 
Substituting these estimates into the filter expression (6.20), we have 


A2 
slmm ht a py gil [2 ,12,,k] (6.24) 
1? 2? he 


a2 
ims ns] = poe 
O° (n,,n,,k) +o; 
The adaptive nature of the filter can be observed from (6.24). When the spatio- 
temporal signal variance is much smaller than the noise variance, O(n,n,,k) © 0, 
i.e., the support $,, ,4) is uniform, the estimate approaches the spatio-temporal 
mean, Å, (7,,7,,k). At the other extreme, when the spatio-temporal signal variance 
is much larger than the noise variance, &?(m,,n,,) >> G2, due to poor motion esti- 
mation or the presence of sharp spatial edges in S, 
noisy image value to avoid blurring. 
A drawback of the adaptive LMMSE filter is that it turns the filtering down even 
if there are a few outlier pixels in the filter support, thus leaving noise in the filtered 


mm y the estimate approaches the 
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image. An alternative implementation, called “the switched LMMSE filter,” may 
maintain the advantages of both the spatio-temporal and the temporal LMMSE 
filtering by switching between a selected set of temporal and spatio-temporal sup- 
ports at each pixel, depending on which support is the most uniform [Ozk 93]. If 
the variance So (m1,k) computed over $,, m g) is less than the noise variance, 
then the filtering is performed using this support; otherwise, the largest support over 
which O(n ,n,,k) is less than the noise variance is selected. In the next section, 
we introduce an adaptive weighted averaging (AWA) filter that employs an implicit 
mechanism for selecting the most uniform subset within SA for filtering. 


Adaptive-Weighted-Averaging Filter 

The adaptive-weighted-averaging (AWA) filter is a variation of the bi-lateral or sigma 
filter, which computes a weighted average of the intensity values within the spatio- 
temporal support along the motion trajectory. The weights are determined by opti- 
mizing a criterion functional, and they vary with the accuracy of motion estimation 
as well as the spatial uniformity of the region around the motion trajectory. In the 
case of sufficiently accurate motion estimation across the entire trajectory and spa- 
tial uniformity, image values within the spatio-temporal filter support attain equal 
weights, and the AWA filter performs direct spatio-temporal averaging. When the 
value of a pixel within the spatio-temporal filter support deviates from the value of 
the pixel to be filtered by more than a threshold, its weight decreases, shifting the 
emphasis to other pixels within the support that better match the pixel of interest. 
The AWA filter is therefore particularly well suited for efficient filtering of sequences 
containing segments with varying scene contents due, for example, to rapid zooming 
and changes in the view of the camera. 


The AWA filter can be defined by 


Sintak] = Ly in DES yn, A EAA ctl (6.25) 
where 
K (n,n,,k) 


wii, D) = A ee 
1+amax {e*,(glm,.7,,4]— gli iL) } 

are the weights within the support S$,, „e along the motion trajectory and 
K(n,,7,,) is a normalization constant, given by 


1 
(4,4 1) ES, 


A a E 
l+ a max {e (gimn, k]— glé,,i,,/1) } 
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The quantities 4>0 and € are the parameters of the filter. These parameters are 
determined according to the following principles: 


1. When the differences in the intensities of pixels within the spatio-temporal 
support are merely due to noise, it is desirable that the weighted averaging 
reduces to direct averaging. This can be achieved by selecting the parameter £? 
appropriately. Note that if the square of the differences is less than €*, then all 
the weights attain the same value A/(1+a £?)=1/L and S[7 n, k] reduces to 
direct averaging. We set the value of £? equal to two times the value of the noise 
variance, i.e., the expected value of the square of the difference between two 
pixel values that differ due only to the presence of noise. 

2. If the square of the difference between the values gln1, n, k] and gli» i L] 
for a particular (i,,2,,/)€ Son m æ) is larger than €*, then the contribution of 
gli; ip L] is weighted down by w(i,, i,;/)<w(n,,2,,k)=Ki(1+a £?°). The “pen- 
alty” parameter 4 determines the sensitivity of the weight to the squared differ- 
ence (g[7,,”,,k]—gli,,i,,/ ])?. It is usually set equal to unity. 


The effect of the penalty parameter a on the performance of the AWA filter can 
be best visualized considering a special case where one of the frames within the filter 
support is substantially different from the rest. In the extreme, when a=0, all weights 
are equal. That is, there is no penalty for a mismatch, and the AWA filter performs 
direct averaging. However, for a large, the weights for the 2M “matching” frames are 
equal, whereas the weight of the non-matching frame approaches zero. Generally 
speaking, the AWA filter takes the form of a “limited-amplitude averager,” where 
those pixels whose intensities differ from that of the center pixel by no more than +e 
are averaged. A similar algorithm is K-nearest neighbor averaging [Dav 78], where 
the average of K pixels within a certain window whose values are closest to the value 
of the pixel of interest are computed. 

The noise variance appears directly in the LMMSE filter expression, whereas 
in the AWA filter, it is used to define the filter parameter €7, which is typically set 
equal to twice the estimated noise variance. Inspection of the results suggests that the 
spatio-temporal filters provide better noise reduction than the respective temporal 
filters, at the expense of introducing some blur. The switched LMMSE filter strikes 
the best balance between noise reduction and retaining image sharpness. The visual 
difference between the results of the LMMSE and AWA filters is insignificant if there 
are no noise outliers or sudden scene changes involved. It has been shown that the 
AWA filter outperforms the LMMSE filter, especially in cases of low-input SNR and 
abruptly varying scene content [Ozk 93]. 
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Temporally Coherent NLM Filtering 


Non-local means (NLM) filtering for image de-noising (see Chapter 3) has been 
extended to temporally coherent filtering by exploiting self-similarity in the spatial and 
temporal dimensions [Liu 10]. The authors introduce approximate K-nearest neigh- 
bor patch matching to find a set of supporting patches from the current frame and 
temporally adjacent frames that are similar to the current patch. The proposed search 
method has lower search complexity to allow for searching for similar patches over the 
entire frame. They also estimate the noise level at each frame for noise-adaptive de- 
noising. The authors argue that robust motion estimation and filtering over temporally 
coherent patches along the motion path are essential for high-quality video de-noising. 


BM4D Filtering 


Block matching 4D (BM4D) extends the powerful collaborative filtering paradigm 
of BM3D for image de-noising (see Chapter 3) to video filtering by exploiting both 
non-local spatial and temporal self-similarity and sparsity in transform domain 
[Mag 12]. It is well known that the similarity of blocks along the motion trajectory is 
stronger than the non-local similarity existing within an individual frame even in the 
presence of fast motion. An earlier extension of BM3D to video de-noising, called 
V-BM3D, groups similar 2D blocks extracted from a set of consecutive frames into 
3D arrays regardless of whether they come from temporal similarity or the non-local 
spatial similarity. In contrast, V-BM4D groups mutually similar spatio-temporal 
volumes, a collection of 3D structures formed by a sequence of blocks of video 
following a specific trajectory, computed according to a non-local search procedure. 
Hence, groups in V-BM4D are 4D stacks of 3D volumes, and the collaborative 
filtering is then performed via a separable 4D spatio-temporal transform. V-BM4D 
provides state-of-the-art video de-noising. 


6.4 Miulti-Frame Restoration 


We discussed single frame (still) image restoration in Section 3.6. This section 
extends the formulation of the image-restoration problem to the case when we have 
a correlated sequence of blurred images, which are degraded by possibly different 
point-spread functions (PSF). The temporal dimension can be used to both esti- 
mate the PSF and to improve the quality of restored images. For example, in linear- 
motion blur, the extent of the spatial spread of a point source at a given pixel during 
the aperture time can be computed from an estimate of the frame-to-frame velocity 
(motion vector) at that pixel, provided that the shutter speed of the camera is known 
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[Tru 92]. Furthermore, if the blur PSF changes from frame-to-frame, the locations 
of zero-crossing of the blur frequency response vary from frame-to-frame enabling 
extraction of more information from a collection of frames than can be extracted 
from any single frame. 


6.4.1 Multi-Frame Modeling 


Suppose we have L frames of video, each blurred by possibly a different spatially 
invariant PSF, /,[n,,7,], k=1,...,L. The vector-matrix model can be extended to 
multi-frame modeling over L frames as 


g=Ds+v (6.26) 
where 
Bi Ss; vi 
g = DS : >Vv = 
BL S; WE 


are N?LX1 vectors representing the observed, ideal, and noise frames, respectively, 
stacked as multi-frame vectors, and 


is an N?LXN?L matrix representing the multi-frame blur operator. Observe that 
the multi-frame blur matrix D is block-diagonal, indicating no temporal blurring. 


6.4.2 Multi-Frame Wiener Restoration 


We employ a multi-frame Wiener deconvolution framework to exploit the temporal 
correlation between frames for spatially shift-invariant but temporally shift-varying 
blurs. That is, we have a shift-invariant 2D blur at each frame, but the blur PSF can 
change from frame-to-frame. Extension to the case of multi-frame spatially shift- 
varying restoration is discussed in Section 6.5.4 within the POCS framework. 

Applying the CLS filter given by (3.107) to the observation equation (6.26) with 
L'L = RR,, we obtain the Wiener estimate $ of the L frames s as 


§=(D'D+R,'R,) D7g (6.27) 
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S; Rn ane Raz Ri Bs Roy 
=| ; IR=| + ^ i fend = gee 
S; Rz ck R, Rz ieee R, 


in which R, =E {ss; } and R,,,=E {v,v; }, i,j=1,2,..., L. Note that if R ,=0 
for i#j, Lj=1,2,..., L, then the multi-frame estimate becomes equivalent to stack- 
ing the L single-frame estimates obtained individually. 

Again, direct computation of $ requires the inversion of the N7ZX N7L matrix in 
(6.27). Because the blur PSF is not necessarily the same in each frame, and the image 
correlations are generally not shift-invariant in the temporal direction, the matrices 
D, R, and R, are not block-Toeplitz; thus, a 3D-DFT would not diagonalize them. 
However, each D ,is block-Toeplitz. Furthermore, assuming each image and noise 
frame is wide-sense stationary in the 2D plane, R, and R, ,, are also block-Toeplitz. 


Approximating the block-Toeplitz sub-matrices D, Rp and R,, by block-circulant 
sub-matrices, each sub-matrix can be diagonalized by separate 2D-DFT operations 
in an attempt to simplify matrix calculations. To this effect, we define the N?LX N7L 


transformation matrix 


where the N?X N? matrix W~! is the 2D-DFT operator defined previously. Note 
that the matrix operator W~', when applied to s, stacks the vectors formed by 
2D-DFTs of the individual frames, but it is not the 3D-DFT operator. 

We pre-multiply both sides of (6.27) with W~! to obtain 


Ww '3=|W 'DWW'D'W+(WRW')'W'RW] WD WW `g 
which can be expressed as 
$ = Q'H*G (6.28) 
where G=W~' g is an N?L vector and P=W~: RW, H*= W-! D'W, and 


Q= HH*+ Pare N?LX N?L block matrices with N*X N? blocks, where each block 


is a diagonal matrix. 
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We present two approaches for the computation of Q ': a general algorithm 
called the cross-correlated multi-frame (CCMF) Wiener filter, which requires the 
auto- and cross-power spectra of the frames, and a specific algorithm called the 
motion-compensated multi-frame filter (MCMEF), which applies to the special case 
of global, constant-velocity motion when a closed-form solution becomes possible. 


Cross-Correlated Multi-Frame Filter 


It was shown in [Ozk 92] that Q 1 is also a block matrix with diagonal blocks, and 
the elements of Q7! can be computed by inverting N? matrices each LXL. This 
derivation will not be given here. The inversion of LXL matrices can be performed 
in parallel. Furthermore, if L is sufficiently small, the LX Z matrices can be inverted 
using an analytic inversion formula. 

The computation of the multi-frame Wiener estimate requires knowledge of the 
covariance matrices R, and R,. We assume the noise is spatio-temporally white; thus, 
the matrix R, is diagonal with all diagonal entries equal to a, although the formula- 
tion allows for any noise covariance matrix. The estimation of the multi-frame ideal 
video-covariance matrix R, can be performed by either the periodogram method or 
3D-AR modeling [Ozk 92]. Once Q™! is determined using the discussed scheme, 
the Wiener estimate can be computed from (6.28). 


Motion-Compensated Multi-Frame Filter 


Assuming that the auto-power spectra of all frames are the same and there is global (cir- 
cular) motion, the cross spectra of the frames are related by a phase factor determined 
by the motion. Given the motion vectors (one for each frame) and the auto-power 
spectrum of the reference frame, the matrix Q`! (hence, the Wiener estimate) can 
be computed analytically using the Sherman—Morrison formula without an explicit 
matrix inversion. The interested reader is referred to [Ozk 92] for further details. 

Other approaches for multi-frame image restoration include iterative non-linear 
optimization formulations with application to imaging through atmospheric tur- 
bulence (Section 3.8 of [Bov 00]) and reduced-order model Kalman filtering for 
progressive and interlaced video [Pat 98]. 


6.5 Multi-Frame Super-Resolution 


Super-resolution (SR) methods can be broadly classified as recognition-based or 
example-based single-frame methods vs. reconstruction-based multi-frame meth- 
ods. The former was briefly discussed in Chapter 3 in the context of nonlinear 


374 Chapter 6. Video Filtering 


image interpolation. This section presents multi-frame reconstruction-based SR 
methods. We start by discussing how the reconstruction-based SR problem differs 
from interpolation and restoration problems and what makes SR possible in Sec- 
tion 6.5.1. Section 6.5.2 presents a model that relates the observed low-resolution 
(LR) frames with aliasing to a hypothetical higher-resolution reference image. The 
SR-reconstruction problem addresses recovery of this higher-resolution image from 
multiple LR frames with aliasing and sub-pixel motion. An early frequency-domain 
solution for a simple special case is described in Section 6.5.3. More recent spatio- 
temporal domain solutions to the general SR problem are presented in Section 
6.5.4. The reader is referred to [Par 03] for a general overview of SR-reconstruction 
methods. 


6.5.1 What Is Super-Resolution? 


Super-resolution (SR) is the process of reconstruction of spatial frequencies beyond 
half of the Nyquist sampling rate that are clearly not available in a sampled image. 
Assuming the frequency is normalized by the sampling rate, such that the highest 
horizontal and vertical spatial frequencies in the input image are m, the input image 
can be first up-sampled by a factor of L, shrinking the highest spatial frequencies 
in the up-sampled image to 7/L (see Section 3.2). Now, the SR problem can be 
defined as estimating those frequencies in the region (77/L<w, <7) U (a/L<w,<7) 
for the up-sampled image. Assuming that the input image is NV XN,, the output 
image is LN,XLN,, and there are L?—1 unknown pixels for each input pixel. 
Hence, the problem is highly ill-posed (under-determined), and there exist infinitely 
many possible solutions in the absence of a strong image-formation model and/ 
or a priori information about the high-resolution (HR) image that constrains the 
solution space. 

SR methods can be classified as recognition/example-based vs. reconstruction- 
based methods. The former methods aim to learn or recognize the missing high- 
frequency patterns from a set of examples or a dictionary [Bak 02, Fre 02], which 
amounts to regularization of the problem by “learned” a priori models for the 
desired HR image. Although these methods yield images that are sharper than can 
be obtained by linear interpolation, they do not attempt to model either the aliasing 
or blurring present in the observed LR images. The reconstruction-based methods 
exploit the aliasing present in LR images, since no imaging sensor employs perfect 
anti-aliasing, by modeling LR image formation accounting for aliasing and sub-pixel 
motion between LR and HR grids (see “What makes super-resolution possible?” 
below). 
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Figure 6.21 Fourier spectrum of an up-sampled digital signal by a factor of 2: 
(a) linear interpolation vs. (b) super-resolution. 


Super-Resolution vs. Image Interpolation 


It is well-known that no new high-frequency information (i.e., not present in the 
input image) can be generated by linear shift-invariant interpolation including the 
ideal band-limited interpolation. Given a full-bandwidth (critically sampled) input 
image occupying the entire frequency range —7<|w|<7r, we know from Sec- 
tion 3.2.2 that linear interpolation by a factor of Z contracts the output spectrum 


to = <|w|< H This is illustrated in Figure 6.21(a), where the spectrum of a 


linearly interpolated signal (L=2) satisfies Y, (e#”) =0, for = <|@|<7r. Note that 
the frequency variable w of the up-sampled signal is normalized with the higher sam- 
pling rate, such that the frequency T/L for the output corresponds to the frequency 
qr for the input. SR methods aim to recover (reconstruct) the missing high-frequency 
band as illustrated in Figure 6.21(b); hence, they reconstruct not only a larger image 
in size but also with higher resolution or definition (with higher frequency content). 


Super-Resolution vs. Image Restoration 


Image restoration does not involve up-sampling an image; hence, no new frequency 
band is created. Image-restoration filters correct the Fourier magnitude/phase of 
optically distorted images only within the original input image-frequency range 
given the optical-transfer function of the distortion (blur). 


What Makes Super-Resolution Possible? 


SR from multiple LR frames is a better-posed problem than single-frame SR since 
each LR frame with sub-pixel motion potentially contains novel information about 
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Figure 6.22 Sub-pixel displacement between frames allows higher-resolution reconstruction. 


the desired HR image, provided that we can estimate sub-pixel motion between 
multiple LR frames accurately. In addition, the presence of aliasing in LR images 
is essential for recovery of missing high-frequency information (see Section 6.2.1). 
Indeed, aliasing is a representation of the missing high-frequency information folded 
over existing frequencies; i.e., the high-frequency content of the desired HR image 
is embedded in the available (low) frequency band of each LR image. Hence, it is a 
combination of i) the presence of aliasing in LR images, and ii) sub-pixel displace- 
ments between multiple images (frames) that make super-resolution possible. 

Four sub-pixel-shifted LR frames are illustrated in Figure 6.22, where large 
squares denote low-resolution pixels, whose intensity is proportional to the average 
brightness within each square. Each image is shifted by a half-pixel to the right and/ 
or half-pixel up in the LR image coordinates and, hence, contains new observations. 
Note that if they were shifted by an integer pixel in both directions, then all four 
images would consist of exactly the same pixels (just displaced); hence, multiple 
frames would contain no new information. When L=2, we estimate four smaller 
(HR) pixels for each LR pixel. The estimated HR image should be consistent with all 
four LR images according to the image capture model. 
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Limits of Super-Resolution 


How fast the SR reconstruction deteriorates as the magnification factor increases 
and whether there is a fundamental limit on the magnification factor that can be 
achieved have been analyzed based on perturbation theory of linear system of equa- 
tions and the condition number of the coefficient matrix [Bak 02, Lin 04]. However, 
it is worth noting that these studies do not consider i) the amount of aliasing in the 
LR images, ii) the accuracy of registration, and iii) the SNR of the input LR images 
explicitly in their analysis. 


6.5.2 Modeling Low-Resolution Sampling 


Most consumer video cameras can record frames at a resolution lower than desirable. 
This is related to some physical limitations, such as finite-sensor-cell area and finite 
aperture time. Although high-resolution professional cameras exist, these may be too 
expensive and/or unsuitable for mobile-imaging applications. In the following, we 
first present a model that relates observed low-resolution (LR) sampled video frames 
with aliasing to the underlying continuous video. Next, the LR frames are expressed 
in terms of a hypothetical high-resolution (HR) reference video frame. The super- 
resolution problem is posed as a reconstruction of this HR frame from a number of 
observed LR frames that are sub-pixel registered. 


Continuous-Discrete Model 


A comprehensive model of LR video acquisition should include the effects of finite- 
sensor-cell area and finite-aperture time. LR images suffer from a combination of 
blurring due to spatial integration at the sensor surface (due to finite cell area), 
modeled by a shift-invariant spatial PSF 4,(x,,x,), and aliasing due to sub-Nyquist 
sampling. In addition, relative motion between the scene and the camera during 
the aperture time gives rise (due to temporal integration) to shift-varying spatial 
blurring. The relative motion may be due to camera motion, as in camera pan or a 
camera mounted on a moving vehicle, or the motion of objects in a scene, which is 
especially significant in high-action movies and sports videos. Recall that it is often 
this motion that gives rise to temporal variation of the spatial-intensity distribution. 
Images may also suffer from out-of-focus blur, which is modeled by a shift-invariant 
spatial PSF / (x, ,x,) (ignoring the depth-of-field) and additive noise. 

A block diagram of the continuous-input, discrete-output model of a low- 
resolution video-acquisition process is depicted in Figure 6.23. The first sub-system 
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Figure 6.23 Block diagram of video acquisition. 


models temporal integration during the aperture time 7,, resulting in a blurred 
video (due to relative motion) given by 


1 Ta 
salts J, S (apat yay (6.29a) 


where s (x,,x,,) denotes the ideal continuous video that would be formed by an 
instantaneous aperture. The model (6.29a) can be interpreted by considering a sta- 
tionary camera and mapping the effect of any camera motion to a relative scene 
motion. The next sub-system models the combined effects of integration at the sen- 
sor surface and any shift-invariant out-of-focus blur, resulting in a further blurred 
video given by 


GX, Xy t) = b(x,,x>)** PAE E, (6.29b) 


where hb (x,,x,)=h,(x,,x,)**4,(x,,x,) is the combined shift-invariant spatial PSF and 
** denotes 2D convolution. The integration in the model over time (6.29a) can 
equivalently be written as a spatial integration at a single time instant 7 using our 
ideal video source model 


5, (%2%q st) = 5, (6 (om) Cy (73%45%p0t),7) (6.30) 


where ¢(73x,,x,,t) =(¢,(73x,,x,,t), €)(73x,,x,,¢)) is the motion trajectory function 
that is defined in Section 6.1.1. Then, assuming the reference time T is within the 
temporal span of all motion trajectories passing through (x,,x,,¢ — Y), 0<Y< T, 
all (x,,x,) within the time interval (t— T, 2) can be traced onto the plane at time 7 
using the source model (6.30) to perform spatial integration over these points. The 
corresponding spatial PSF is given by 
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Figure 6.24 Representation of the effect of motion during the aperture time 
(temporal integration) by a spatial PSF (spatial integration). 
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where J(u, u, t) is the Jacobian of the change of variables and ¢, '(T; x, x,,14) 
returns the time t—T ,<t <twhen u, =c; (T; X1% to). (See [Pat 97b] for a complete 
derivation of this PSE) The spatial support of 4, (u,,,,7;*,,*,, £), which is a shift- 
variant PSF, is a path at the reference time 7, which is depicted in Figure 6.24. 

The path, mathematically expressed in Eqn. (6.31) by the 1D delta function 
within the interval EMG eet — TT t), is obtained by mapping all 
points (x,,x,,t—y), 0<y<T, (shown by the solid vertical line) onto time 7. There- 
fore, combining the video acquisition model in Figure 6.23 with the source model 
(6.30), we can establish the relationship 


CO =ff 5,(U,,U,T)A(U,,U,,T3%X,,X,t) du, du, (6.32) 


between the observed intensity g(x}, x, £) and the ideal intensities s (u,,u,,7) along 
the path (u, u )=c(T; x, X,t) at time T, where 


PA by Migs Ta Xi = Bsr) e h ipts TH sys B) 


denotes the effective shift-variant spatial PSF of the composite model. 
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The degraded video is then ideally sampled on a 3D low-resolution lattice A,, 
and noise is added to form the observed discrete video 


g, 73) a g(x, ost) | [x ,x2 ne v, (7 ,7 ) (6.33) 


where V, is the sampling matrix of the low-resolution lattice A, If we assume that 
g(x Xpt) is bandlimited, but the sampling intervals in the x,, x,, and ¢ directions 
are larger than the corresponding Nyquist intervals, aliasing may result. It is essential 
that no anti-alias filtering be used prior to sampling in order to achieve super-resolu- 
tion, which follows from the discussion in Section 6.2.1. 


Discrete-Discrete Model 


Next, we relate a set of low-resolution observations g,(7,,7,) to the desired reference 
high-resolution frame to be reconstructed, which is defined as 


(6.34) 


s(m,,m,) = s, (x, xX,,t) | 


[x1 x2 .t r= V; [rm ,sms ay 


where V, is the sampling matrix of the high-resolution sampling grid. Let’s assume 
that the high-resolution frame s(m,,m,) is sampled above the Nyquist rate, so that 
the continuous intensity pattern is more or less constant within each high-resolution 
pixel (cells depicted in Figure 6.25). Then, given the motion trajectory c(T; x}, X3, 2) 
passing through (7, 7,, k), we have, from (6.33) and (6.32), 


Guts) ~ som sm.) | f Bae t6,575%,5%2 8) du, du, 


a Y i =e h an 
where [x,,x,, A" =V tetak] a [u,, Uz, T]"=V [m pti i] T, and (的 je 
i.e., T is the time coordinate of the reference frame and (m,,m,) denotes the spa- 
tial coordinates of the pixel in the high-resolution reference frame that matches 
(72,73, k). Next, we define 


blr mm mm) = 上 Or 和 人 du, 


to arrive at our discrete-input (high-resolution video), discrete-output (observed 
low-resolution video) model, given by 


Zi (nm) = Li, Lop, SCI» 72, hy, (mM, , 2,37, ,2,) + v, (1,57) (6.35) 


where the support of the summation over the high-resolution grid (m1, m,) at a par- 
ticular low-resolution sample (7,, 7,, k) is depicted in Figure 6.25. 
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Figure 6.25 Illustration of the discrete system PSF. 


The size of the support in Figure 6.25 depends on the relative velocity of the scene 
with respect to the camera, the size of the support of the low-resolution sensor PSF 
h (xx) (depicted by the solid line, assuming no out-of-focus blur) with respect 
to the high-resolution grid, and whether there is any out-of-focus blur. Because the 
relative positions of low- and high-resolution pixels in general vary from pixel-to- 
pixel, the discrete sensor PSF is space-varying. 

The model (6.35) establishes a relationship between the desired high-resolution 
frame and the observed low-resolution pixels from all frames & that are connected 
to the desired frame by means of a motion trajectory. That is, each low-resolution 
observed pixel (7,,7,,) can be expressed as a linear combination of several high- 
resolution pixels from the desired frame provided that (2,,”5,k) is connected to 
the desired frame by a motion trajectory. We assume that occlusion regions can be 
detected and pixels for which the model is invalid are excluded. 


Problem Interrelations 


The super-resolution problem stated by (6.35) is a superset of other filtering prob- 
lems that are discussed in this chapter as follows: 


1. Noise filtering: Input and output lattices are identical A,#A,; sensor PSF is not 
taken into account, / (x,,x,)=6(x,,x,); the camera aperture time is negligible, 
T,=0; and there is no optical blur, h Ax 5%) =O, X,). 

2. Multi-frame/image restoration: Input and output lattices are identical, A +A ,; 
sensor PSF is not taken into account, i.e., we assume 4 (ns Mig) = OC es) 

3. Standards conversion: Input and output lattices are different, A/#A,, but sen- 
sor PSF is not taken into account, h (x, ,X»)=6(x,,x,); the camera aperture time 
is negligible, 7}=0; there is no optical blur, h (xp x) =8(x,,x,); and there is no 
noise. 
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We first present a frequency-domain solution to a special case (global transla- 
tion) in Section 6.5.3, which helps the reader to understand why aliasing should 
be present in the LR images for SR reconstruction. We address the general super- 
resolution problem formulated by Eqn. (6.35) in Section 6.5.4 using spatial- 
domain methods. 


6.5.3 Super-Resolution in the Frequency Domain 


The frequency-domain method, first proposed by Tsai and Huang [Tsa 84], 
exploits the relationship between the continuous and discrete-Fourier transforms 
of the under-sampled frames in the special case of global sub-pixel shifts. If we let 
pX) =5,(%1,%,,0) denote the reference frame, and assume zero aperture time 
(T=0) and rectangular sampling with the sampling interval A in both directions, 
then the continuous-input, discrete-output model given by Eqn. (6.33) simplifies as 


N,-1Nj-1 


gm m) = PRP IMA: TAX = B,) «+h,(%,,x,)| d(x, —,A,x, —n,A) 


m=0 n,=0 


Fw, (ma) (6.36) 


where a, and B, denote the x, and x, components of the displacement of frame k 
with respect to the reference frame, 5(x,,x,) denotes the 2D Dirac delta function, 
and the low-resolution frames are assumed to be NXN. Because of the linearity of 
convolution, Eqn. (6.36) can be equivalently expressed as 


N,-1N2-1 


&,(n,.n,) = > > llamo eh by T QX =B] d(x, —2,A,x, — hA) 


m =0 m=0 
+o, (m) (6.37) 
Suppose we wish to reconstruct an MXM high-resolution sampled version of 


the reference frame s)(x,,x,), where M is an integer multiple of N, i.e., R=M/N is an 
integer. Assuming that s,(x,,x,) is bandlimited, such that 


ISA F)|=0 for |F 





1 
|| > R— 
E> R- 


the model (6.37) under-samples s% pX) by a factor R in both x, and x, directions. 
Taking the Fourier transform of both sides of the model (6.37), we obtain 
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Gilf fa) 


1 fi-i, fi fi—i fi i) FAW) Bs} 
er = Ss | 和 2) H | P 
"A A A 


+V,(fi> fa) (6.38) 





where S, (4.4) and H, (á A) denote the 2D continuous Fourier transform of 


the reference frame s,(x,,x,) and the sensor PSF h (2i X) respectively, and G,( fi,p) 
and V,(f,,f,) denote the 2D discrete-Fourier transform of g,(7,,7,) and v,(n,,7,), 
respectively. We note that motion blur can be incorporated into the simplified model 
(6.38) (i.e., the assumption T=0 may be relaxed) only if we have global, constant- 
velocity motion, so that the frequency response of the motion blur remains the same 
from frame-to-frame. 

fh fs h) 


(fi,f), we need at least R? equations (6.38) at that ien A pair, which could be 
obtained from L> R low-resolution frames. The set of equations (6.38), k=1,..., L, 
at any frequency pair (fi,p) are decoupled from the equations that are formed at 
any other frequency pair. The formulation of Tsai and Huang [Tsa 84] ignores sen- 
sor blur and noise and proposes to set up L>R? equations in as many unknowns at 
£L) 
ATA 
illustrated by the following example. If we take the sensor blur into account, we can 
fi h) n (LL [EA 
A A A A 


estimated by inverse filtering or any other regularized de-convolution E 


In order to recover the unaliased spectrum $, (á, at a given frequency pair 


each frequency pair to recover the alias-free spectrum S, | . This procedure is 


first recover the product S, | can be 


Al. Subsequently, S, 


Example: Frequency-Domain Super-Resolution: 1D Case 


Consider two low-resolution observations of a 1D signal (L=2), which are 
shifted with respect to each other by a sub-sample amount Qa, such that 


l g (n)= Ss (x) (x —nA)+v(n) 
g,(n) = SA (x —a)8(x 一 nA) + v(n) 


where N is the number of low-resolution samples and the sensor PSF is 
assumed to be a delta function. 
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Assuming that g,(n), k= 1,2, are both sampled at one-half of the Nyquist 
rate (R=2), and taking the Fourier transform of both sides, we have 


cen bales 


A) A A 
1 jt fe 1 一 1] -s-a 
aisi 


There is only one aliasing term in each expression, because we assumed sam- 
pling at half the Nyquist rate. We can solve for the two unknowns S)( f/A) 
and S,((f— 1)/A) from these two equations given G, (f) and G,( f). Repeat- 
ing this at each frequency sample, the spectrum of the alias-free HR signal 
can be recovered. 


The following remarks about super-resolution in the frequency domain are in 


order: 


l. 


I; 


ys 


The frequencies in the range OSf,, {<1 are discretized by K samples along each 
direction, where K=M, and M is the number of high-resolution signal samples. 
Then, samples of the high-resolution frame can be computed by a KX K inverse 
2D-DPT. 

It is easy to see that if the image s (x,,x,,0) were passed through a perfect anti- 
alias filter before the low-resolution sampling, there would be only one term in 
the double summation in Eqn. (6.38), and recovery of a high-resolution image 
would not be possible no matter how many low-resolution frames were avail- 
able. Thus, it is the aliasing terms that make the recovery of a high-resolution 
image possible. 


The frequency-domain approach has some drawbacks: 


‘The set of equations (6.38) at each frequency pair ( fi, fọ) can best be solved in 
the least-squares sense due to the presence of observation noise. Thus, more than 
LI? equations, hence more frames, are needed to be solved for the Z? unknowns 
at each frequency pair (fi, f,). 

The set of equations (6.38) may be singular depending on the relative posi- 
tion of the sub-pixel displacements, œ, and B,, or due to zero-crossings in the 
Fourier transform H ( f,,f5) of the sensor PSF. In particular, if there are more 
than L frames with shifts (a pB,) on a line parallel to either the x; Or X axis, 
or if there are more than L(Z*?—L—1)/2 pairs of frames with shifts (a pB,) that 
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are symmetric with respect to the line x, =x,, the system of equations becomes 
singular. This fact is stated by a theorem in [Kim 90]. Furthermore, it is clear 
from Eqn. (6.38) that any zeros in H (fi,f) result in a column of zeros in the 
coefficient matrix of the system of equations. Then, regularization techniques 
need to be employed that limit the resolution improvement. 

3. This approach has been extended by Kim et. al. [Kim 90] to take noise and blur 
in the low-resolution images into account, where blur and noise characteristics 
need to be the same for all frames of the low-resolution data, and impulse sam- 
pling has been assumed for the low-resolution images (i.e., the low-resolution 
sensor has no physical size). This method was further refined by Kim and Su 
[Kim 93] to take into account blurs that are different for each frame of low- 
resolution data by using a Tikhonov regularization. The resulting algorithm 
does not treat the formation of blur due to motion or sensor size, and may suffer 
from convergence problems. 


6.5.4 Multi-Frame Spatial-Domain Methods 


The general SR-reconstruction problem formulated by the space-varying image 
formation model (6.35) can be addressed by spatial-domain methods that can 
be classified as early two-step interpolation-restoration methods, regularized SR- 
reconstruction methods, including Bayesian and set-theoretic methods, and methods 
that do not require sub-pixel motion estimation. They are reviewed in the following. 

An important application of SR reconstruction is creating a single high-quality 
still image, a snapshot, from a video clip. Video snapshots [Sun 12] is a recent system 
that combines SR, de-noising, and deblurring with importance (saliency) weighting 
to produce either a snapshot that suppresses independently moving objects, or a 
snapshot that summarizes the motion of salient objects in a single frame. 


Interpolation-Restoration Methods 


Early work on SR was based on two-stage interpolation-restoration methods, which 
essentially first map all pixels from available low-resolution frames onto a single up- 
sampled reference frame by using image-registration techniques. However, unless we 
assume global, constant-velocity motion, the up-sampled reference frame contains 
non-uniformly spaced samples. In order to obtain a uniformly spaced up-sampled 
image, an interpolation onto a uniform sampling grid needs to be performed. Then, 
a post-processing step, where image restoration is applied to the up-sampled image to 
remove the effect of the sensor PSF blur has been used. Sauer and Allebach [Sau 87] 
propose an iterative method to reconstruct band-limited images from non-uniformly 
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spaced samples. Their method estimates image intensities on a uniformly spaced 
grid using a projections-based method. Ur and Gross [Ur 92] have proposed a 
non-uniform interpolation scheme based on the generalized sampling theorem of 
Papoulis to obtain an improved-resolution blurred image, which is then restored 
using an image-restoration step. An important drawback of these algorithms is that 
they do not address removal of aliasing artifacts. Furthermore, the restoration stage 
neglects errors in the interpolation stage. 


Regularized-Reconstruction Methods 


Super-resolution is an inverse problem. The forward process of modeling LR-image 
formation results in a set of simultaneous linear equations given by Eqn. (6.35). 
Hence, the solution of the set of simultaneous linear equations given by (6.35) is an 
inverse problem. Suppose that the desired high-resolution (HR) frame is MXM, and 
we have L LR frames, each NXN. Then, we can set up LX NX WN equations in M? 
unknowns to reconstruct the HR frame. These equations are linearly independent 
provided that all displacements between LR frames are sub-pixel. Clearly, the num- 
ber of equations will be reduced by the number of occluded pixels encountered along 
the motion trajectories. In general, it is desirable to set up an overdetermined system 
of equations, i.e., the number of LR frames L>R?=M?/N?, to obtain a robust solu- 
tion. We note that fast methods to solve the set of simultaneous equations (6.35) are 
not available, because the impulse-response coefficients 4,(7,, 75, m, m,) are spatially 
varying in general; hence, the system matrix is not block-Toeplitz. 

The inverse problem is ill-posed, because the presence of any noise in LR images 
and small errors in sub-pixel motion estimates leads to large errors in the solution. 
To this effect, regularization of the solution by incorporation of a priori image and 
noise models (similar to those used for image de-noising and restoration) is required 
to obtain a stable solution in the presence of noise. Hence, the problem is formulated 
as optimization of some regularization cost function subject to the constraint that 
the estimated high-resolution image is consistent with all observed low-resolution 
images and a priori models, which is solved iteratively. 

All regularization methods that were discussed for solving the image-restoration 
problem (see Section 3.6) are also applicable to solving the SR-reconstruction prob- 
lem. The regularization methods for solving the SR-reconstruction problem include 
i) constrained least squares [Par 03], ii) iterative back-projection [Ira 91, Ira 93], 
iii) set-theoretic methods including the projection onto convex sets (POCS) method 
[Tek 92, Pat 97b], iv) Bayesian methods including maximum likelihood (ML) and 
maximum a posteriori probability (MAP) [Sch 96], v) hybrid methods combining 
MAP and POCS [Ela 97], and vi) methods using self-similarity and sparsity models 
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[Tak 09]. We note that Komatsu et al. [Kom 93] have made the observation that 
using multiple cameras with different pixel apertures can overcome some limitations 
in the camera/motion configurations that may be encountered when using cameras 
with the same pixel aperture. 


Bayesian Super-Resolution 


The MAP-estimation framework is a popular Bayesian approach to solve ill-posed 
image restoration and super-resolution reconstruction problems iteratively as a non- 
linear optimization problem. In this framework, the image-formation model is stated 
in a stochastic form by representing the model error 


vi (5%) = gi (mm) — >_> smm, )h, (mm, 3% 1,) 

as Gaussian observation noise. Schultz and Stevenson [Sch 96] employed discon- 
tinuity-preserving Huber—Markov Gibbs priors as the a priori image model. 
They formulated a constrained optimization problem with a unique minimum that 
can be found by iterative methods. More details can be found in [Sch 96, Par 03]. 

Excellent results have been reported for sequences with global frame-to-frame 
motion, such as camera pan or other global warping. More modest resolution 
improvements were observed for scenes with independent object motions [Sch 96]. 


Set-Theoretic Methods 


Set-theoretic methods address the general super-resolution problem in the POCS 
framework, including the special cases of multi-frame shift-varying restoration. In 
the following, we present a POCS-based method to solve a set of simultaneous linear 
equations (6.35) [Tek 92, Pat 97b]. A similar but more limited solution was also 
proposed by Stark and Oskoui [Sta 89]. The POCS formulation presented here is 
similar to that in Section 3.6.4 for intra-frame restoration of shift-varying blurred 
images. 

We define a different closed, convex set for each observed low-resolution pixel 
(n7, k) that is connected to the desired high-resolution reference frame by a 
motion trajectory as follows: 


eget = {xim m] | (7)| aak O= n =< N = k=l Sita 5, (6.39) 


where 
M-1M-1 


n (m,m ) = giim 71]— >> > xim, m, lb, (m,m, m,m) 


m,=0m,=0 
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and 5, represents the confidence we have in the observation, set equal to co where 
g, is the standard deviation of the noise and c=0 is determined by an appropriate 
statistical confidence bound. These sets define high-resolution images that are con- 
sistent with the observed low-resolution frames within a confidence bound that is 
proportional to the variance of the observation noise. 

The projection y(m,,m,) =P, ,{x(7m,m,)} of an arbitrary x(m,,m,) onto 
Cnn is defined as 


n; 


Pek {x(m, m,)} = 


n (m,,2,)—6 


‘ + 4-12 In, <I ifr (n,n )> ô 
x(m,,m,) = = i (0, p,t,,%) y (77, M, 5M, » 2 ) i, (1, » My 0 
x, (mi ,1m, ) if —6, = 7; (7%) = 4, (6.40) 
r anm) +6, 


x(m,,m,) + h, (m,m, n,n) if 5 (msm) E=; 


£, È, Ai (0, pmm) 


Additional constraints, such as amplitude and/or finite support constraints, 
can be utilized to improve the results. The amplitude constraint C, has been 
defined in (3.112), and the projection P, onto the amplitude constraint C, is given 
by (3.113). 

Given the above projection operators, an estimate §(m,,m,) of the high- 
resolution image s(,,m,) is obtained by successive projections onto each observa- 
tion set iteratively as 


59mm, = 下 > a a T, my Jk {39 [m,, m,]} 
= mim, =M= 1,7= 01, ... (6.41) 


where T denotes the generalized projection operator. An initial estimate $% (m,m, ) 
of the high-resolution image is computed by interpolating the low-resolution refer- 
ence frame to the desired resolution using bilinear or bicubic interpolation. The 
projections aim to ensure that the reconstructed high-resolution image is consistent 
with each and every pixel of all the observed low-resolution images as depicted in 
Figure 6.26. Excellent super-resolution reconstructions have been reported using 


this procedure [Tek 92, Pat 97b]. 
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Figure 6.26 Successive projections onto convex sets. 


A few observations about the POCS method are in order: 


1. While certain similarities exist between the POCS iterations and the 
Landweber-type iterations [Tru 85, Ira 91, Ira 93], the POCS method can 
adapt to the amount of observation noise, while the latter generally cannot. 

2. The proposed POCS method can also be applied to shift-varying multi-frame 
restoration and standards conversion problems by specifying the input and out- 
put lattices and the shift-varying system PSF h,(m,, m,, n, n,) appropriately (as 
stated in Section 3.6.4). 

3. The POCS method finds a feasible solution, i.e., a solution consistent with all 
available low-resolution observations. Clearly, the more low-resolution observa- 
tions (more frames with reliable sub-pixel motion estimates) are available, the 
better the high-resolution reconstructed image s(m,,m,) will be. In general, it 
is desirable that L > M? / N*. Note, however that, the POCS method gener- 
ates a reconstructed image with any number L of available frames. The number 
L is just an indicator of how large the feasible set of solutions will be. Note that 
the size of the feasible set can be further reduced by employing other closed, 
convex constraints in the form of statistical or structural image models. 


There are several variations of the basic POCS solution to the super-resolution 
reconstruction problem. Elad and Feuer [Ela 97] proposed a hybrid super-resolution 
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reconstruction algorithm that combines the benefits of the MAP and POCS meth- 
ods. The hybrid approach defines a single optimum solution while enforcing all 
convex constraints. Alrunbasak et al. [Alt 02] incorporated quantization constraints 
for super-resolution reconstruction from compressed video bitstreams. 


Super-Resolution without Sub-Pixel Motion Estimation 


Since precise sub-pixel optical flow estimation in the presence of independently 
moving objects and occlusion is an exceedingly difficult problem, the practical appli- 
cability of SR-reconstruction methods is limited to video with global motion, e.g., 
affine camera motion. To overcome this problem, Takeda et al. [Tak 09] have pro- 
posed 3D steering kernel regression (in space-time) for super-resolution, which does 
not require explicit, sub-pixel accurate motion estimation. In this approach, each 
pixel is approximated by a 3D Taylor series, whose coefficients are estimated by 
solving a local weighted least-squares problem. The weights capture the 3D space- 
time orientation in the local neighborhood, which implicitly contains information 
about the local motion of the pixels across time, therefore rendering unnecessary an 
explicit computation of sub-pixel motion estimates. They have developed an iterative 
implementation of this algorithm with rough (pixel accurate) motion compensation 
to accommodate fast and complex motions. 
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Exercises 
Problem Set 6 


6.1 


6.2 


6.3 


6.4 


6.5 


6.6 


6.7 


6.8 


Let the horizontal and vertical bandwidths of a video signal with an unknown 
global constant velocity be 10° cyc/mm. Suppose it is sampled on a vertically 
aligned 2:1 interlaced lattice with the parameters A, = A, = 100 microns, and 
A =1/60sec. Find all critical velocities, if any. 


Show that the spatio-temporal impulse response of the filter that performs 
“merging” of even and odd fields to form a composite frame is given by 


Alx,,t) = 8(x,) 8(D + 6(x,) 8l + T) 


where T is the field interval. Find the frequency response of this filter. Discuss 
frequency-domain interpretation of merging for stationary and moving image 
regions. 


Compare the relative advantages and disadvantages of two-, three-, and four- 
frame motion-detection algorithms from interlaced video. 


How would you deal with motion-estimation errors in motion-compensated 
up-conversion? How would you compare motion-adaptive filtering and adap- 
tive motion-compensated filtering in the presence of motion-estimation errors? 


How would you compare the motion-compensated adaptive LMMSE and 
AWA filters in the presence of a sudden scene change? 


Discuss the frequency response of the MCMF filter (Section 6.4) as it relates 
to the theory of motion-compensated filtering presented in Section 6.1. 


Consider constant-velocity global motion, where the velocity is modeled 
as constant during each aperture time (piecewise constant-velocity model). 
Let its value during ith aperture time (acquisition of the ith frame) be given 
by v,=[v,,; vz; T. Show that the Jacobian J(u,» ut) in Eqn. (6.31) is equal 
to 1/|v,,| 


Suppose we have four low-resolution images that have global translations with 
respect to a reference frame with velocities v,,=v,,=8.25, 8.5, 9, and 10 in 
units of pixels per high-resolution sampling grid for i=1, ... , 4, respectively. 
Assume that each side of the low-resolution sensor cell is four times that of 
the high-resolution cell with rectangular pixel geometry shown in Figure 6.25. 
Calculate the discrete-discrete PSF 4,,(7,,m,, 2,,7,) in Eqn. (6.35). 
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6.9 Derive Eqn. (6.38). 


6.10 Derive Eqn. (6.40). 


6.11 Discuss the relationship between the POCS reconstruction discussed in Sec- 
tion 6.5.4 and the back-projection iterations presented in [Ira 93]. 


MATLAB Exercises 


6.1 De-interlacing 


6.2 


Suppose we have a standard-definition interlaced color video stored in a .yuv 


file in composite-frame format, where pixels for each frame are ordered as all 
Y pixels first, followed by Cb pixels and Cr pixels sequentially. Note that the 


chrominance components are 4:2:0 sub-sampled; hence, the Y component is 
704X480 pixels, and Cb and Cr are each 352X240 pixels. 


a. 


b. 


d. 


Extract (separate) the even and odd fields given an interlaced video .yuv 
file. 

Implement the bob filter (intra-averaging) to generate full-size frames 
from even and odd fields. 

Implement the weave filter to generate full size frames from even and odd 
fields. 

Implement the motion-adaptive bob-and-weave filter to generate full-size 
frames from even and odd fields. 


Compare results and write your observations about the results. 


AWA Filter 
Given a video sequence with N frames 


a. 


b. 


Add zero-mean, white Gaussian noise with variance a? to each frame. 
Estimate motion trajectory at each pixel. Use any motion-estimation 
method of your choice. Comment on the quality of the motion estimates 
as the variance of the noise increases. 


. Implement the AWA filter along the motion trajectories given by (6.25). 


How do you select a>0 and £? Comment on the effect of these parameters 


on the performance of the filter as a function of the noise variance or”. 


CHAPTER / 


Image Compression 





Compression may be mathematically lossless or lossy. The more the visual-quality 
degradation (loss) that can be tolerated, the higher the compression ratio will be. 
Compression of images without significant loss of perceived quality is possible because 
images contain a high degree of i) spatial redundancy, due to correlation between 
neighboring pixels, ii) spectral redundancy, due to correlation among color compo- 
nents, and iii) psychovisual redundancy, due to what the human eye cannot see. The 
more the redundancy, the higher the achievable compression will be. 


The need for effective data compression is evident in almost all applications where 
storage and transmission of digital images are involved. For example, an 8.511 
document scanned at 300 pixels/in with 1 bit/pixel generates 8.4 Mbits data. A 
high-resolution digital still image with 4000 pixels X 3000 lines and 8 bits/pixel per 
color is 288 Mbits. This chapter introduces the basics of image compression, and 
discusses some commonly used lossless and lossy still-image-compression methods 
and standards. Further references are included for those who wish to implement 
the covered compression algorithms and standards. Some preliminary concepts on 
information theory as well as elements of an image-compression system, includ- 
ing quantization and entropy coding, are introduced in Section 7.1.Most popular 
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lossy image-coding algorithms employ the transform-coding para-digm. Section 
7.2 covers discretecosine transform (DCT) based image coding and the Interna- 
tional Standards Organization (ISO) JPEG standard for lossy image compression. 
Wavelet-transform based lossless and lossy image coding and the ISO JPEG2000 
standard are discussed in Section 7.3. 


7.1 Basics of Image Compression 


This section first summarizes some basic results from information theory that pro- 
vide bounds on the achievable compression ratios and bit-rates, then presents the 
basic elements of a general image-compression system. 


7.1.1 Information Theoretic Concepts 


A source X with an alphabet A is defined as a discrete random process (a sequence of 
random variables X, i=1, ...) in the form X=X, X, ..., where each random variable 
X, takes a value from the alphabet A: In the following, we assume the alphabet con- 
tains M, a finite number of symbols, i.e., A={a,,@,,..., a,j}. A discrete memoryless 
source (DMS) is such that successive symbols are statistically independent. It is com- 
pletely specified by the probabilities p(z,)=p,, i=1,..., M such that p,+...+p,,=1. 

According to information theory, the information content of a symbol is related 
to the extent that the symbol is unpredictable or unexpected. If a symbol with low 
probability occurs, a larger amount of information is transferred than in the occur- 
rence of a more likely symbol. This quantitative concept of surprise is formally 
expressed by the relation 


I(a,) = log, all for a,c A (7.1) 
pla) 
where /(a,) is the information that the symbol a, with probability p(a,) carries. The 
unit of information is bits when we use logarithm with base-2. Observe that if p=1, 
then as expected /一 0, and at the other extreme, 7 一 © as p — 0. 

In variable-length source coding (VLC), in the case where we assign an indi- 
vidual codeword to each individual symbol, the optimum length of the binary code 
for a symbol is equal to the information (in bits) of the symbol. In practice, the 
probability of occurrence of each symbol is estimated from the histogram of a given 
source or a training set of sources. 


7.1 Basics of Image Compression 399 


The entropy H(X) of a DMS X with an alphabet A is defined as the average 


information per symbol in the source, given by 


a dD Ds > a p(a;) log, eal 5 of Deer p(4,) log, ( p(a,)) (7.2) 


The more skewed the probability distribution of the symbols, the smaller the entropy 
of the source. The entropy is maximized for a flat distribution, i.e., when all symbols 
are equally likely. It follows that a source where some symbols are more likely than 
others has a smaller entropy than another source where all symbols are equally likely. 
Hence, the performance of lossless encoding of a source will be related to the entropy 
of the source. 


Example: Entropy of a Raw Image 


Suppose an 8-bit image is taken as a realization of a DMS X. The symbols i 
are gray levels of pixels in the image, and the alphabet A is the set of all possi- 
ble gray levels between 0 and 255. Then, the entropy of the image is given by 


H(X)= -¥ pò) log, (7(2)) 


where p(z) denotes the relative frequency of occurrence of the gray level in 
the image. Note that the entropy of an image consisting of a single gray level 
(constant image) is zero. 


Next, we present two fundamental theorems, the lossless-coding theorem and 
the source-coding theorem, which are used to assess the performance of lossless cod- 
ing and lossy coding methods, respectively. 


Lossless-Coding Theorem [Sha 48] 


The minimum bit-rate that can be achieved by lossless coding of a DMS Xis given by 


Rain = A(X) + e bits/symbol 
where Ris the transmission rate, H(X) is the entropy of the source, and £ is a positive 
quantity that can be made arbitrarily close to zero. 

The lossless-coding theorem establishes the lower bound for the bit-rate neces- 
sary to achieve zero coding-decoding error in the case of a DMS. We will introduce 
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Huffman coding, which can approach this bound for DMS by encoding each sym- 
bol independently, and arithmetic coding, which assigns a single codeword to an 
arbitrary length group of input symbols. 

In lossy coding, the achievable minimum bit-rate is a function of the distortion 
that is allowed. This relationship between the bit-rate and distortion is given by the 
rate-distortion function [Ber 71] as stated by the source-coding theorem. 


Source-Coding Theorem 


There exists a mapping from the source symbols to codewords such that for a given 
distortion D, R(D) bits/symbol are sufficient to enable source reconstruction with an 
average distortion that is arbitrarily close to D. The actual rate R should obey 


R= R(D) 


for fidelity D. The function R(D) is called the rate-distortion function. Note that 
R(0)= H(AX). 

A typical rate-distortion function is depicted in Figure 7.1. The rate-distortion 
function can be computed analytically for simple source and distortion models. 
Computer algorithms exist to compute R(D) when analytical methods fail or are 
unpractical [Ber 71]. In general, we are interested in designing a compression system 
to achieve either the lowest bit-rate for a given distortion or the lowest distortion at 
a given bit-rate. Note that the source-coding theorem does not state how to design 
algorithms to achieve these desired limits. 


0 D 


Distortion, D max 


Figure 7.1 Rate-distortion function. 
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7.1.2 Elements of lmage-Compression Systems 


In information theory, the process of data compression by redundancy reduction is 
referred to as source encoding. Images contain two types of redundancy: statistical 
(spatial) and pyschovisual. Statistical redundancy is present because certain spatial 
patterns are more likely than others, whereas psychovisual redundancy originates 
from the fact that the human eye is insensitive to certain spatial frequencies. The 
block diagram of a source encoder is shown in Figure 7.2. It is composed of the fol- 


lowing blocks: 


1. Transformer (T) applies a one-to-one transformation to the input image data. 
The output of the transformer is an image representation that is more amenable 
to efficient compression than the raw image data. Typical transformations are 
linear predictive mapping, which maps the pixel intensities onto a prediction 
error signal by subtracting the predictible part of the pixel intensities; unitary 
mappings such as the discrete-cosine transform, which pack the energy of the 
signal to a small number of coefficients; and multi-resolution mappings, such 
as sub-band decompositions and the wavelet transform. 

2. Quantizer (Q) generates a limited number of symbols that can be used in the 
represention of the compressed image. Quantization is a many-to-one mapping 
that is irreversible. 

3. Coder (C) assigns a binary codeword to each symbol at the output of the quan- 
tizer. It may employ fixed-length or variable-length codes. 


Different image-compression systems implement different combinations of these 
choices. Image-compression methods can be broadly classified as: 


。 Lossless (noiseless) compression methods, which aim to minimize the bit-rate 
without any distortion in the image. 

。 Lossy compression methods, which aim to obtain the best possible fidelity for 
a given bit-rate, or to minimize the bit-rate to achieve a given fidelity measure. 


Pixel Real-valued Finite-alphabet Binary 
blocks 


[| coefficients 加 symbols stream 
a> 


Figure 7.2 Block diagram of a lossy image-compression system. A lossless compression system 
employs an integer transform and no quantization. 
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The transformation and encoding blocks are lossless. However, quantization is 
lossy. Therefore, lossless methods, which only make use of the statistical redundan- 
cies, do not employ a quantizer. In most practical cases, a small degradation in the 
image quality must be allowed to achieve the desired bit-rate. Lossy compression 
methods make use of both the statistical and psychovisual redundancies. In the fol- 
lowing, we discuss quantization and entropy coding. 


7.1.3 Quantization 


Quantization is the process of representing a source consisting of set of continuous- 
valued samples with a finite number of states (also called a finite alphabet). It can 
be performed using scalar or vector quantizers. If each sample is quantized inde- 
pendently, the process is known as scalar quantization. Vector quantization refers 
to quantization of a block of samples, represented by a vector, at once, with a finite 
number of vector states [Ger 92]. Here, we focus on scalar quantizers since state-of- 
the-art image and video compression employs scalar quantization. - 

A scalar quantizer Q(-) is a function that is defined in terms of a finite set of 
decision levels d, and reconstruction levels r,. The quantized variable, s, is given by 


§=Q(s) =r, ifs €(d,_,,d,|,i=1,....L (7.3) 


where L is the number of output quantized states. That is, the output of the quan- 
tizer is the reconstruction level x,, if s, the value of the sample before quantization, 
is within the range (d,_,,d,]. The distance between successive decision (reconstruc- 
tion) levels can be equal or unequal, called uniform and non-uniform quantization, 
respectively. 

The performance of a quantizer is measured by a distortion measure D, which 
is a function of the quantization error, e=s—§, that depends on d, and r, If we treat 
s as a realization of a random variable S with a probability density function (pdf) 
Ps(s), the distortion measure may be taken as the mean-square quantization error 
D=E{(s — §)?}, where E{-} stands for the expectation operator. 

Given a distortion measure D and the source pdf ps), there are two optimum 
scalar quantizer design methodologies: 


1. Fora fixed number of levels L, find x and d, i=0, ..., L, in order to minimize 
distortion D. Non-uniform quantizers with a fixed number of levels that are 
optimal in the mean-square-error sense, D=E{(s 一 r,)*}, are known as Lloyd- 
Max quantizers [Llo 82, Max 60]. 
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2. For a fixed output entropy H(-)=C, where C is a constant, find r, and d, 
i=0,...,Z, (L is unknown) in order to minimize D. Quantizers that minimize 
a distortion measure for a constant output entropy are known as entropy- 
constrained quantizers [Woo 69]. 


A detailed discussion of the design of Lloyd-Max quantizers and entropy- 
constrained quantizers can be found in [Gra 98]. Image/video compression stan- 
dards employ uniform quantization with a fixed number of levels, which is a special 
case of Lloyd-Max quantizers. This case is discussed in the following. 


Uniform Quantization 


A quantizer is called a uniform quantizer if the distances between successive recon- 
struction levels are equal, i.e., 


一 = — ¢ ms 
ka 0 LEE=1 
where 0 is a constant called the step-size. Uniform quantizers can be classified as 
mid-tread or mid-riser. A mid-tread quantizer, described by 


= | 1 
Q(s) = sgn(s) 3 E 








0= ninr |=|0 (7.4a) 


where |x| denotes the smallest integer greater than x and NINT denotes the nearest 
integer round-off, has a zero-valued reconstruction level, while a mid-riser quantizer 


described by 





Q(s) -| 3 re 3) 9 (7.4b) 


2 


has a zero-valued decision level. Mid-tread quantizers should be preferred when the 
pdf of the source is symmetric about s=0 and decay for larger values of s. In mid- 
tread quantizers, the decision region around §=0 is called the deadzone. 


Example: Uniform Quantization in JPEG Image Compression 


In JPEG image compression, the pdf of the source (DCT coefficients) is 
modeled by a zero-mean Laplacian distribution, given by 


-a 





s l 
机 
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Hence, mid-tread uniform quantization given by (7.4a) is employed. The 
encoder computes the integer quantization index values 


k = NINT A 
0 
that are entropy encoded for transmission/storage. The reconstructed values 
§=Q(s)=kO 


are computed by the decoder, given the step-size 0. JPEG employs human 
visual system weighted quantization, where a different step-size is used for 
each frequency coefficient position. The recommended (default) step-sizes 
have been determined by the JPEG committee. 


Example: Uniform Quantization of a Source Uniformly Distributed 

in [A,B] 

If p(s) is uniformly distributed over an interval [A, B], then the uniform quan- 

tizer is the Lloyd—Max quantizer, and for L levels, the step-size is given by 
= B-A 


0 = —— 
i, 


Then, using a mid-riser quantizer, the reconstruction levels are given by 


(7.4b). 


Example: Quantization Noise 

Suppose we have a memoryless zero-mean Gaussian source S with variance 
a”. Let the distortion measure be the mean-square error. What is the min- 
imum number of levels, equivalently the rate R in bits/sample, to obtain 
40 dB SNR, assuming uniform quantization? We can express the mean- 
square quantization noise as 


D=E{(s—sy} 


Then, the SNR in dB is given by 


2 


Cr 
SNR = 10 log,, = 
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2 


SNR=40 dB implies = =10,000. Substituting this into the rate-distortion 


function for a memoryless Gaussian source, given by 


we can compute R(D) = 7 bits/sample. Similarly, we can show that quantiza- 
tion with 8 bits/sample yields approximately 48 dB SNR. 


7.1.4 Symbol Coding 


Symbol coding is the process of assigning a binary string to individual symbols or 
to a block of symbols comprising the source. We start by discussing the simplest 
scheme, which is to assign equal-length codewords to individual symbols or a fixed- 
length block of symbols, known as fixed-length coding. Often, significantly better 
compression can be achieved by assigning shorter-length codewords to more prob- 
able symbols, which is the main principal of entropy coding. As a result, most image- 
and video-compression schemes employ entropy coding. 


Fixed-Length Coding 

Fixed-length coding assigns equal-length codewords to each symbol in the alphabet 
A regardless of their probabilities. If the alphabet has M different symbols (or blocks 
of symbols), then the length of the codes is the smallest integer greater than log, M. 
Two commonly used fixed-length coding schemes are natural codes and Gray codes, 
which are shown in Table 7.1 for the case of a four-symbol source. Notice that in 
Gray coding, the consecutive codewords differ in only one bit position. This prop- 
erty may provide an advantage in error detection. Gray codes are also better suited 
for run-length encoding of bit-planes. 


Table 7.1 Fixed-Length Codes for a Four-Symbol Alphabet 


Symbol Natural Code Gray Code 
a, 00 00 
a, 01 01 
a, 10 11 
a; 11 10 
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It can easily be shown that fixed-length coding is optimal only when: i) the num- 
ber of symbols is equal to a power of 2, and ii) all the symbols are equiprobable. Only 
then will the entropy of the source be equal to the average length of the codewords, 
which is equal to the length of each codeword in the case of fixed-length coding. 
For the example shown in Table 7.1, both the entropy of the source and the average 
codeword length is 2, assuming all symbols are equally likely. 


Entropy Coding 


In most compression applications, some symbols are more probable than others, 
where it would be more advantageous to use entropy coding, which assigns variable- 
length codewords to each symbol. Entropy coding, also known as variable-length 
coding (VLC), assigns codewords in such a way as to minimize the average codeword 
length for the source. This is achieved by assigning shorter codewords to more prob- 
able symbols, which is the fundamental principle of entropy coding. Indeed, the 
goal of the transformation box in Figure 7.2 is to obtain a set of symbols with a skew 
probability distribution to minimize the entropy of the transformed source. 

Two popular methods of entropy coding are Huffman coding and arithmetic 
coding, which are introduced in more detail in the next two sub-sections. ‘The first, 
Huffman coding, assigns variable-length codes to a fixed-length block of symbols, 
where the block length is typically one, and the length of the codewords is propor- 
tional to the information (in bits) of the respective symbols or block of symbols. The 
latter, arithmetic coding, assigns variable-length codes to a variable-length block of 
symbols. 

Note that the rate R to encode the quantized sample values in the case of fixed- 
length coding is given by R =|log, L|, where |x| denotes the smallest integer greater 
than x, while in the case of entropy coding or VLC, it is given by R> H(i) according 
to the lossless coding theorem. 


7.1.5 Huffman Coding 


Huffman coding yields the optimal integer prefix codes given a source with a finite 
number of symbols and their probabilities. In prefix codes, no codeword is a prefix 
of another codeword. Such codes are uniquely decodable since a given binary string 
can only be interpreted in one way. Huffman codes are optimal in the sense that no 
other integer-length VLC can be found to yield a smaller average bit-rate. In fact, 
the average length of Huffman codes per codeword achieves the lower bound, the 
entropy of the source, when the symbol probabilities are all powers of 2. Huffman 
codes can be designed by following a very simple procedure. 
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Let X denote a DMS with the alphabet A and the symbol probabilities p(a,), 
a EA iSl, oo M. Obviously, if M=2, we must have 


c(a,) = 0 and c(a,) = 1 (7.5) 


where c(a) denotes the codeword for the symbol a, i=1, 2. If A has more than 
two symbols, the Huffman procedure requires a series of source reduction steps. In 
each step, we find and merge the two symbols with the smallest probabilities, which 
results in a new source with a reduced alphabet. The probability of the new symbol 
in the reduced alphabet is the sum of the probabilities of the two “merged” symbols 
from the previous alphabet. This procedure is continued until we reach a source with 
only two symbols, for which the codeword assignment is given by Eqn. (7.5). Then 
we work backwards toward the original source, each time splitting the codeword of 
the “merged” symbol into two new codewords by appending it with a zero and one, 
respectively. The following examples demonstrate this procedure. 


Example: Symbol Probabilities are Powers of 2 


Let the alphabet A consist of four symbols, shown in Table 7.2. The prob- 
abilities and information of the symbols in the alphabet are also listed in the 
table. Note that all symbol probabilities are powers of 2, and consequently 
the symbols have integer information values. 

The Huffman-coding procedure for this alphabet is given in Table 7.3 
and Figure 7.3. The reduced alphabet in step 1 is obtained by merging the 
symbols a, and a, in the original alphabet that have the lowest two probabili- 
ties. Likewise, the reduced alphabet in step 2 is obtained by merging the two 
symbols with the lowest probabilities after step 1. Since the reduced alphabet 
in step 2 has only two symbols, we assign the codes 0 and 1 to these symbols 
in arbitrary order. Next, we assign codes to the reduced alphabet in step 1. 
Recall that the symbol 2 in step 2 is obtained by merging the symbols 2 and 


Table 7.2 An Alphabet Where the Symbol Probabilities Are Powers of 2 


Symbol Probability Information 
a, 0.5 1 
a, 0.25 2 
a, 0.125 3 
a, 0.125 3 


408 Chapter 7. Image Compression 


Table 7.3 Illustration of Alphabet Reduction 


Reduced Alphabet Reduced Alphabet 
Original Alphabet Step 1 Step 2 





e oea 0 
p=0.5 
a 0 


a3 





a4 1 p=0.25 
p=0.125 


Figure 7.3 Tree diagram for Huffman coding. 


3 in step 1. Thus, we assign codes to symbols 2 and 3 in step 1 by append- 
ing the code for symbol 2 in step 2 by a zero and one in arbitrary order. 
The appended zero and one are shown in bold fonts in Table 7.3. Finally, 
the codes for the original alphabet are obtained in a similar fashion. This 


procedure can alternatively be described by the tree diagram shown in 
Figure 7.3. 


Observe that in this case, the average codeword length is 


R =0.5X1+0.25X2+0.125 X3 + 0.125 X3 = 1.75 
and the entropy of the source is 
H =—0.5 log,0.5— 0.25 log, 0.25 — 0.125 log, 0.125 —0.125 log, 0.125 = 1.75 
which is consistent with the result that Huffman coding achieves the entropy 


of the source when the symbol probabilities are powers of 2. Next, we present 
an example in which the symbol probabilities are not powers of 2. 
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Example: Symbol Probabilities are Not Powers of 2 


When the probabilities of the symbols are not powers of 2, the information 
content of each symbol is a real number, as shown in Table 7.4. 

Since the length of each codeword must be an integer, it is not possible to 
design codewords whose lengths are equal to the information of the respec- 
tive symbols in this case. Huffman-code design for the alphabet in Table 7.4 
is shown in Table 7.5. It can be easily seen that for this example the average 
length of codewords is 2.15 and entropy of the source is 2.07. 

Notice that Huffman codes are uniquely decodable, with proper syn- 
chronization, because no codeword is a prefix of another. For example, a 
received binary string 


001101101110000... 


can be decoded uniquely as 


a, a, 4,4, a, 4, a, Ags. 


Table 7.4 An Alphabet Where the Symbol Probabilities Are Not Powers of 2 


Symbol Probability Information 
a, 0.4 1,32 

a, 0.25 2 

a, 0.15 2.73 

a 0.15 2.73 

as 0.05 4.32 


Table 7.5 Huffman Coding When Probabilities Are Not Powers of 2 


Original Reduced Alphabet Reduced Alphabet 
Step1 Step2 


Alphabet 











Reduced Alphabet 
Step 3 
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Block Coding 


Until now, we have discussed scalar coding, where each symbol in the alphabet is 
assigned an individual code. Block coding refers to the case where we do not assign 
a separate code to each symbol, but assign codewords to blocks of L symbols from 
the original alphabet. Of course, this requires building a new block alphabet with all 
possible combinations of the L symbols from the original alphabet and computing 
their respective probabilities. Huffman codes for all possible combinations of the L 
symbols from the original alphabet can be formed using the previously described 
design procedure with the new block alphabet. Thus, Huffman coding may be con- 
sidered a block-coding scheme, where we assign variable-length codes to fixed-length 
(L) blocks of symbols. The case L=1 refers to assigning an individual codeword to 
each symbol of the original alphabet, as shown in the previous examples. It has been 
shown that for sources with memory, the coding efficiency improves as L gets larger, 
although the design of the Huffman codes gets more complicated. 


7.1.6 Arithmetic Coding 


In arithmetic coding, a one-to-one correspondence between the symbols of an alpha- 
bet A and the codewords does not exist. Instead, arithmetic coding assigns a single 
variable-length code to a source X, composed of N symbols from the alphabet, where 
N is variable. The distinction between arithmetic coding and block Huffman coding 
is that in arithmetic coding the length of the input sequence, i.e., the block of symbols 
for which a single codeword is assigned, is variable. Thus, arithmetic coding assigns 
variable-length codewords to variable-length blocks of symbols. Because arithmetic 
coding does not require assignment of integer-length codes to fixed-length blocks of 
symbols, in theory it can asymptotically achieve the lower bound established by the 
lossless-coding theorem. 

Arithmetic coding associates a given realization of X, x={x,,...,x,}, with a 
sub-interval of [0,1) whose length equals the probability of the sequence p(x). The 
encoder processes the input stream of symbols one by one, starting with N=1, 
where the length of the sub-interval associated with the sequence gets smaller as M 
increases. Bits are sequentially sent to the channel starting from the most-significant 
bit toward the least-significant bit as they are determined according to the proce- 
dure presented in an algorithmic form in the following. At the end of the transmis- 
sion, the transmitted bitstream is a uniquely decodable codeword representing the 
source, which is a binary number pointing to the sub-interval associated with this 
sequence. 
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Procedure 


Consider an alphabet A with M symbols a, i=1,...,M, with probabilities p(4) =p, 
such that p, +...+p,y= 1. We start by assigning each individual symbol in the alpha- 
bet a sub-interval, within 0 to 1, whose length is equal to its probability. It is assumed 
that this assignment is known to the decoder. 


1. If the first input symbol x,=a,i=1,...,M, then define the initial sub- 
interval as L=M) = [PP tp where pa” O: Set n=l, i=f, RER 
and dEr =h: 

2. Obtain the binary expansions of L and R as 


L= 0,2 and Re (7.6) 


where u, and v, are 0 or 1. Compare u, and v}. 


a. If they are not the same, send nothing to the channel, and go to step 3. 

b. If u =v, then send the binary symbol xj, and compare u, and v,. If they 
are not the same, go to step 3. 

c. Ifu,=v,, also send the binary symbol w,, and compare us and v,, and so on, 
until the next two corresponding binary symbols do not match, at which 
time go to step 3. 

3. Increment n, and read the next symbol. If the mth input symbol x,=a, then 
sub-divide the interval from the previous step as 


L = ar) = [4-1 + p41, Fa + p;)d) 
Set L=/,, R=r,, and d=r,— L, and go to step 2. 


Note that the decoder may decode one binary symbol into several source sym- 
bols, or it may require several binary symbols before it can decode one or more 
source symbols. The operations of the encoder and decoder are illustrated by the 
following example. 

Example: Arithmetic Coding 


Lers determine an arithmetic code to represent a sequence of symbols, 
do By hy on 


from the source shown in Table 7.3. Because we have four symbols in the 
alphabet, the interval between 0 and 1 is initially sub-divided into four, 
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Symbol 2i % 43 4, 
Decimal 0 0.5 0.7 0.875 1 
Binary 0.0 0.1 0.11 111 1.0 
Decimal 0.5 0.625 .6875 0.71875 0.75 
Binary 0.1 0.101 0.1011 0.10111 0.11 
Decimal 0.5 0.5625 0.59375 0.609375 0.625 
Binary 0.1 0.1001 0.10011 0.100111 0.101 


Figure 7.4 Illustration of the concept of arithmetic coding. 


where the lengths of the sub-intervals are equal to 0.5, 0.25, 0.125, and 
0.125, respectively. This is depicted in Figure 7.4. 

The first symbol defines the initial interval as I,=[0.5,0.75), where the 
binary representations of the left and right boundaries are L=2~'=0.1 and 
R=2 1!+2 ?=0.11, respectively. According to step 2, u,=v,=1; thus, 
“1” is sent to the channel. Noting that ~,=0 and v,=1, we read the sec- 
ond symbol, a,. Step 3 indicates that 1,=[0.5,0.625), with L=0.10 and 
R=0.101. Now that u,=v,=0, we send “0” to the channel. However, 
u,=0 and v,=1, so we read the third symbol, a,. It can be easily seen that 
1,=[0.59375, 0.609375), with L=0.10011 and R=0.100111. Note that 
uj=v,=0, u= u= l, and u; =v;=l, but u¿=0 and v=1. At this stage, we 
send “011” to the channel, and read the next symbol. A reserved symbol usu- 
ally signals the end of a sequence. 

Let’s now briefly look at how the decoder operates, which is illustrated 
in Table 7.6. The first bit restricts the interval to [0.5,1). However, three 
symbols are within this range; thus, the first bit does not contain sufficient 
information. After receiving the second bit, we have “10,” which points to 
the interval [0.5,0.75). All possible combinations of two symbols pointing 
to this range start with 4,. Hence, we can now decode the first symbol as æ. 
The information that becomes available after the receipt of each bit is sum- 
marized in Table 7.6. 

Although we assumed fixed-symbol probabilities above, arithmetic cod- 
ing allows adapting probability tables after encoding each symbol, which is 
called adaptive arithmetic coding [Pen 88]. The updates must be computed in 
a “causal” manner that can be duplicated by the decoder so the encoder and 
decoder remain in synch at all times. 
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Table 7.6 Operation of the Decoder 








Received Bit Interval Decoded Symbol 
1 [0.5, 1) = 
0 (0.5, 0.75) es 
0 [0.5, 0.625) 4, 
1 [0.5625, 0.625) F: 
1 [0.59375, 0.625) 一 


In practice, two factors cause the performance of the arithmetic encoder to fall 
short of the theoretical bound: the use of finite-precision arithmetic and the addi- 
tion of an end-of-message indicator. Practical implementations of the arithmetic 
coder overcome the precision problem by using a scaling and a rounding strategy 
[Wit 87]. 


7.2 Discrete-Cosine Transform Coding and JPEG 


Transform coding, developed more than three decades ago, has proven to be an 
effective image-compression scheme, and is the basis of most world standards for 
lossy compression to date. A basic transform coder segments the image into small 
blocks. Each block undergoes a 2D orthogonal transformation to produce an array of 
transform coefficients. Next, the transform coefficients are quantized and coded. The 
coefficients having the highest energy over all blocks are most finely quantized, and 
those with the least energy are quantized coarsely or simply truncated. The encoder 
treats the quantized coefficients as symbols that are then entropy (variable-length) 
coded. The decoder reconstructs the pixel intensities from the received bitstream 
following the inverse operations on a block-by-block basis. The block diagrams of a 
transform encoder and decoder are shown in Figure 7.5. 

The best transformation for the purposes of effective quantization should pro- 
duce uncorrelated coefficients, and pack the maximum amount of energy (variance) 
into the smallest number of coefficients. The first property justifies the use of scalar 
quantization. The second property is desirable because we would like to discard as 
many coefficients as possible without seriously affecting image quality. The transfor- 
mation that satisfies both properties is the Karhunen—Loeve transformation (KLT). 
Despite its favorable theoretical properties, the KLT is not used in practice, because: 
i) its basis functions depend on the covariance matrix of the image, hence they have 
to be recomputed and transmitted for every image, ii) perfect decorrelation is not 
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Construct Forward Ca Symbol 
NXN blocks DET encoder 


(a) 


Symbol Inverse Merge 


(b) 





Figure 7.5 Block diagram for transform coding: (a) encoder and (b) decoder. 


possible, since images can rarely be modeled as realizations of homogeneous random 
fields, and iii) there are no fast algorithms for its implementation. After many con- 
siderations, the discrete-cosine transform (DCT), an orthonormal transform with 
data-independent basis functions, has been found to be the most effective with a 
performance close to that of the KLT [Net 88, Rab 91]. 


7.2.1 Discrete-Cosine Transform 


The DCT is the most widely used transformation in transform coding. It is an orthog- 
onal transform that has a fixed set of (image-independent) basis functions, an efficient 
algorithm for its computation, and good energy compaction and correlation-reduction 
properties [Rao 90]. Ahmed et al. [Ahm 74] first noticed that the KLT basis func- 
tions of a first-order Markov image closely resemble those of the DCT. They become 
identical as the correlation between adjacent pixels approaches one. 

The DCT belongs to the family of discrete-trigonometric transforms, which has 
16 members [Mar 94]. The type-2 DCT of an NXN block is defined as 


4 AI AI ar(2n, + 1k ar(2n, +1)k 
S(k,,k,) = fe CIC) 2 E simm) cos TER OE cos TC 10) 
where pi, Ros Bi %,=0, 1, .<.; N—1,.and 


1 = 
1 


otherwise 


A significant factor in transform coding is the block size. The most popular sizes 
are 8X8 and 16X16, both powers of 2 for computational reasons. It is no surprise 
that the DCT is closely related to the DFT. In particular, an NXN DCT of s(x,,7,) 
can be expressed in terms of a 2NX2N DFT of its even-symmetric extension, which 
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leads to a fast computational algorithm (see Section 1.2.4). 

Using the separability of the DFT and one of several fast Fourier transform 
(FFT) algorithms, it is possible to compute an NXN DCT using O2N 2 log, N) 
operations instead of O(N 4), where O(-) stands for “order of.” In addition, because 
of the even-symmetric extension process, no artificial discontinuities are introduced 
at the block boundaries, unlike the DFT, resulting in superior energy compaction. 
The fact that the computation of the DCT requires only real arithmetic facilitates its 
hardware implementation. As a result, DCT is widely available in special-purpose 
single-chip VLSI hardware, which makes it attractive for real-time use. 

The energy-compaction property of the 8X8 DCT is illustrated in the following 
example. 


Example: Energy-Compaction Property of DCT 
An 8X8 block of pixels from the 7th frame of the Mobile and Calendar 


sequence is: 





The DCT coefficients (nearest integer), after subtracting 128 from each pixel 
intensity, are: 
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Observe that the high-frequency coefficients (around the lower-right corner) 
are much smaller than the low-frequency coefficients around (0,0) (at the 


upper-left corner). 


7.2.2 ISO JPEG Standard 


The JPEG standard describes a family of image-compression techniques for 
continuous-tone (gray-scale or color) still images. JPEG-baseline algorithm features 
lossy compression based on transform coding to remove statistical and pyschovisual 
redundancy. Work toward the JPEG standard got started in March 1986. In January 
1988, JPEG reached consensus that the adaptive DCT approach should be the basis 
for the standard. The JPEG committee successfully finalized the international stan- 
dard in July 1992 [Wal 91, Pen 93]. The JPEG standard supports: 


1. Resolution independence: Arbitrary source resolutions can be handled. Images 
whose dimensions are not multiples of 8 are internally padded to multiples of 8 
in DCT-based modes of operation. 

2. Precision: DCT modes of operation are restricted to 8 and 12 bits/sample preci- 
sion. For lossless coding the precision can be from 2 to 16 bits/sample, although 
JBIG has been found to perform better below 4 to 5 bits/sample. 

3. No absolute bit-rate targets: The bit-rate/quality tradeoff is controlled primarily 
by the quantization matrix (see Section 7.2.3). 

4. Luminance-chrominance separability: The ability exists to recover a luminance- 
only image from luminance-chrominance encoded images without always hav- 
ing to decode chrominance. 


JPEG datastreams are defined in terms of what a JPEG decoder needs in order 
to decompress the datastream. No particular file format, spatial resolution, or 
color space model is specified as part of the standard. However, JPEG includes a 
minimal recommended file format, JPEG File Interchange Format (JFIF), which 
enables JPEG bit-streams to be exchanged between a wide variety of platforms 
and applications. In addition, several commonly available image-file formats 
are JPEG-compatible. In other words, the viewer or the application programs 
must recognize the specific file format in addition to being able to decode JPEG- 
compressed images. 

JPEG provides four modes of operation: sequential (baseline), hierarchical, 
progressive, and lossless. The first three are discussed below. A JPEG- compatible 
product must support at least the baseline mode. 
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Baseline Algorithm 


The JPEG baseline mode is inspired by the scene adaptive coder [Che 84]. The main 
steps of the baseline algorithm can be summarized as: 


1. DCT computation: The image is first sub-divided into 8X8 blocks. Each pixel 
is level-shifted by subtracting 2”~', where 2” is the maximum number of gray 
levels. That is, for 8-bit images we subtract 128 from each pixel in an attempt 
to remove the DC level of each block. The 2D-DCT of each block is then com- 
puted. In the baseline system, the input and output data precision is limited to 
8 bits, whereas the quantized DCT values are restricted to 11 bits. 

2. Quantization of the DCT coefficients: The DCT coefficients are threshold- 
coded using a quantization matrix and then reordered using zigzag scanning to 
form a 1D sequence of quantized coefficients. The quantization matrix can be 
scaled to provide a variety of compression levels. The entries of the quantization 
matrix are usually determined according to psychovisual considerations, which 
are discussed below. 

3. Variable-length code (VLC) assignment: The non-zero AC coefficients are 
Huffman-coded using a VLC code that defines the value of the coefficient and 
the number of preceding zeros. Standard VLC tables are specified. The DC 
coefficient of each block is differential pulse-code modulation (DPCM)-coded 
relative to the DC coefficient of the previous block. 


Color 


JPEG uses a standard color space (ITU-R 601-1). It transforms RGB images into a 
luminance-chrominance space, known as the Y-Cr-Cb space, defined by 


Y=0.3R+06G+0.1B 
Gr=A— +05 


pun age 
L6 


Because the human eye is relatively insensitive to the high-frequency content of 
the chrominance channels Cr and Cb (see Figure 3.2(b)), they are sub-sampled by 2 
in both directions. Tis is illustrated in Figure 7.6, where the chrominance channels 
contain half as many lines and pixels per line compared to the luminance channel. 

JPEG orders the pixels of a color image as either non-interleaved (three separate 
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Macroblock 
Y11 | Y12 | Y15 | Y16 


Figure 7.6 Macroblock formation. 


scans) or interleaved (a single scan). Referring to Figure 7.6, the non-interleaved 
ordering is given by 


Scan 12 Y1, Y2, Y3, .3 X16 


Scan 2: Crl, Cr2, Cr3, Cr4 


Scan 3: Cb1, Cb2, Cb3, Cb4 


whereas the interleaved ordering becomes 
XL YY 15; ¥4, Cel, Col, Ys, Yo, Y7 YS, Cod, Ch, > 


Interleaving makes it possible to decompress the image and convert from 
luminance-chrominance representation to RGB for display with a minimum of 
intermediate buffering. For interleaved data, the DCT blocks are ordered according 
to the parameters specified in the frame and scan headers. 


Psychovisual Aspects 


In order to exploit the psychovisual redundancy, JPEG incorporates characteris- 
tics of the human visual system through specification of quantization matrices. It 
is well known that the frequency response of the human visual system drops off 
with increasing spatial frequency. Furthermore, this drop-off is faster in chromi- 
nance channels, which is demonstrated by the contrast-sensitivity function depicted 
in Figure 2.4(b). It shows that small variations in intensity are more visible in slowly 
varying regions than in busier ones, and they are also more visible in the luminance 
components compared to the chrominance components. 

To this effect, JPEG allows specification of two quantization matrices, one for the 
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luminance and another for the two chrominance channels, to allocate more bits to 
the representation of coefcients, which are visually more signifcant. Tables 7.7 (a) 
and (b) show JPEG-specified default quantization matrices for the luminance and 
chrominance channels, respectively. The elements of these matrices are based on the 
visibility of individual 8X8 DCT basis functions with a viewing distance equal to six 
times the screen width. The basis functions are viewed with a luminance resolution 
of 720 pixels X 576 lines and a chrominance resolution of 360576. The matrices 
suggest that DCT coefficients corresponding to basis images with low visibility can 
be more coarsely quantized. 


Example: Demonstration of the JPEG Baseline Algorithm 


The JPEG baseline algorithm is applied to the 8X8 luminance block given 
in Section 7.2.1. Te gray levels are frst shifted by —128 (assuming that the 
original is an 8-bit image) and then a forward DCT is applied. The DCT 


Table 7.7 (a) Quantization Table for the Luminance Channel and (b) Quantization Table for the 
Chrominance Channel 


(a) 


16 11 10 16 24 40 51 61 
12 12 14 19 26 58 60 55 
14 13 16 24 40 57 69 56 
14 17 22 29 51 87 80 62 
18 22 37 56 68 109 103 77 
24 33 22 64 81 104 113 92 
49 64 78 87 103 121 120 101 
72 92 95 98 112 100 103 99 
(b) 
17 18 24 47 99 99 > 99 
18 2 26 66 99 99 99 99 
24 26 56 99 29 99 99 99 
47 66 92 99 99 99 99 99 
99 99 99 99 99 99 99 39 
99 99 99 99 99 99 99 99 
99 99 99 99 99 99 99 99 


99 99 99 99 99 99 99 99 
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coefcients divided by the quantization matrix, shown in Table 7.7(a), have 
been found as 





Following a zigzag scanning (depicted in Figure 7.7) of these coefcients, 
the 1D coefficient sequence can be expressed as 


20, S; 三 名 =j < = I; 1; =I, =I. 0, 0, ls 2; 35 
=2, La 40,0, 0,00, 0; 1,3, 0,1, EOB 


where EOB denotes the end of the block (i.e., all following coefficients are 
zero.) 

Te DC coefcients are coded using DPCM, as depicted in Figure 7.8. 
That is, the difference between the DC coefficient of the present block and 
that of the previously encoded block is coded. (Assume the DC coefficient of 
the previous block was 29, so the difference is —9.) The AC coefficients are 
mapped into symbols that are in the form of (RUN, LEVEL) pairs, given by 


(0, 5), (0, =o) (0, =I), (0, =Z} (0, =); (0, 1); (0, 1), (0, = 1), (0, =; 
(2, 1),(0, 2), (0, 3), (0, =2 (, 1),.(0, 1), 6, 1). 40. 1); C1, 1), EOB 





Figure 7.7 Zigzag scan order. 
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Figure 7.8 Operation of JPEG baseline (a) encoder and (b) decoder. 


where RUN indicates the number of zeros preceeding the non-zero coeffi- 
cient value LEVEL in the zigzag-scanned 1D coefficient sequence. 

The codewords for these symbols can be found according to the tables: 
JPEG coefficient coding categories, JPEG default DC codes (luminance/ 
chrominance), and JPEG default AC codes (luminance/chrominance) [Gon 
07, Pen 93]. For example, the DC difference —9 falls within the DC differ- 
ence category 4. The proper base (default) code for category 4 is 011 (a 3-bit 
code), while the total length of a completely encoded category 4 coefficient is 
7 bits. The remaining 4 bits will be the least-significant bits (LSBs) of the dif- 
ference value. Each default AC Huffman codeword depends on the number 
of zero-valued coefficients preceding the non-zero coefficient to be coded as 
well as the magnitude category of the coefficient. 

The decoder implements the inverse operations. That is, the received 
coefficients are first multiplied by the same quantization matrix to obtain 
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Performing an inverse DCT, and adding 128 to each element, we find 
the reconstructed block 





The reconstruction error is as large as +25 gray levels, as can be seen by 
comparing the result with the original block shown in Section 7.2.1. Te 
compression ratio, and hence the quality of the reconstructed image, is con- 
trolled by scaling the quantization matrix (see Section 7.2.3). 


JPEG Progressive 


The progressive DCT mode refers to encoding of the DCT coefficients in mul- 
tiple passes, where a portion of the quantized DCT coefficients is transmitted in 
each pass. Two complementary procedures, corresponding to different groupings of 
the DCT coefficients, have been defined for progressive transmission: the spectral 
selection and successive approximation. An organization of the DCT coefficients for 
progressive transmission is depicted in Figure 7.9, where MSB and LSB denote the 
most-significant bit and the least-significant bit, respectively. 

In the spectral selection process, the DCT coefficients are ordered into spectral 
bands where the lower-frequency bands are encoded and sent first. For example, 
the DC coefficient of each block is sent in the initial transmission, yielding a rather 
blocky first image at the receiver. The image quality is usually acceptable after the first 
five AC coefficients are also transmitted. When all the DCT coefficients are eventu- 
ally coded and transmitted, the image quality is the same as that of the sequential 
algorithm. 

In the successive approximation method, the DCT coefficients are first sent with 
lower precision, and then refined in successive passes. The DC coefficient of each 
block is sent first with full precision to avoid mean level mismatch. The AC coef- 
ficients may be transmitted starting with the MSB plane. Successive approximation 


7.2 Discrete-Cosine Transform Coding and JPEG 423 


Spectral selection 





OT wa 63 
DC Zigzag scan order 


Figure 7.9 Arrangement of the DCT coefficients for progressive transmission. 


usually gives better-quality images at lower bit-rates. The two procedures may be 
intermixed by using spectral selection within successive approximations. 


JPEG Hierarchical 


The hierarchical mode of operation employs the concept of pyramid coding [Bur 
83] and may be considered as a special case of the progressive transmission, with 
increasing spatial resolution between the progressive stages. Multi-resolution image 
representation, in the form of the Gaussian pyramids depicted in Figure 3.12, was 
discussed in Section 3.2.3. 

In the first stage, the lowest-resolution image (top layer of the Gaussian pyramid) 
is encoded using one of the sequential or progressive JPEG modes. The decoded 
output of each stage is then interpolated to form the prediction for the next stage. 
The bilinear interpolation filter that doubles the horizontal and vertical resolution is 
specified in the standard. In the next stage, the difference between the actual second 
layer in the Gaussian pyramid and the up-sampled first-layer image is encoded and 
transmitted. The procedure continues until the residual of the highest-resolution 
image is encoded and transmitted. In the hierarchical mode of operation, the image 
quality at extremely low bit-rates surpasses any of the other JPEG modes, but this 
is achieved at the expense of a higher bit rate at the completion of the progression. 


7.2.3 Encoder Control and Compression Artifacts 


Since JPEG compression quantizes DCT coefficients, there is always a loss of fidel- 
ity when an image is encoded and then decoded. If sufficiently fine quantization is 
used, the decoded pictures will be visually lossless (pixel differences will not be visible 
by humans) but the bit-rate (bits/pixel) or filesize will be larger. On the other hand, 
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by humans) but the bit-rate (bits/pixel) or filesize will be larger. On the other hand, 
if coarser quantization is employed then the encoding/decoding error will manifest 
itself as visible compression artifacts. These artifacts are called ringing and blocking 
(see Section 2.5.1). 

In JPEG, the tradeoff between decoded image quality and bit-rate (filesize) 
can be controlled by scaling the default quantization matrix, which is given in 
Table 7.7. Scaling the quantization matrix by a factor larger than 1 results in a 
coarser quantization and a lower bit-rate at the expense of higher compression error. 
The compression ratio (CR) is defined as the ratio bits/pixel of the original image 
vs. the compressed image. For an 8-bit monochrome image, CR=8 (i.e., 1 bit/pixel 
encoding) results in visually lossless compression for most images. Severe ringing and 
blocking is observed at CR= 15 or higher. 

Most JPEG implementations give the user a parameter to trade image size for 
image quality, such as the quality factor (QF) parameter, which scales the quantiza- 
tion matrix. The range of values for QF is typically between 1-100, where QF=75 
corresponds to the unscaled default matrix, but this may vary from implementation 
to implementation. 


7.3 Wavelet-Transform Coding and JPEG 2000 


Wavelet-transform coding is a multi-resolution image-coding approach that is closely 
related to earlier sub-band coding, which improves upon pyramid coding used in the 
hierarchical mode of the original JPEG. Both wavelet and sub-band coding employ 
a “complete” multi-resolution image representation, where the number of samples 
is equal to that in the original image, whereas pyramid coding uses an “overcom- 
plete” pyramid-image representation with a larger number of samples than that of 
the original image. 

The basic idea of wavelet/sub-band coding is to decompose an image into non- 
overlapping frequency bands to obtain a set of low-pass, bandpass, and high-pass 
sub-images. Each sub-image is then sub-sampled, encoded separately using a bit-rate 
matched to its perceptual requirements. The decoder reconstructs the image by add- 
ing upsampled and appropriately filtered versions of the decoded sub-images. 

In the following, we first address the decomposition of an image into sub-images, 
and its reconstruction from these sub-images, which is called the analysis-synthesis 
problem. We also compare the filters used in sub-band and wavelet transforms. Spe- 
cific methods used for coding the individual sub-images in sub-band coding vs. the 
JPEG2000 standard are presented next. 
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7.3.1 Wavelet Transform and Choice of Filters 


The discrete-wavelet transform (DWT) is a multi-resolution image decomposi- 
tion, which can be considered as expansion of an image onto a set of wavelet- 
basis functions. The wavelet-basis functions, unlike the basis function for DFT 
and DCT, are well localized in both space and frequency. This expansion onto 
discrete wavelet basis functions is generally implemented by a digital filterbank 
using a pair of low-pass and high-pass filters (similar to sub-band coding). Since 
we almost always employ separable filtering, the computation of 2D-wavelet trans- 
form reduces to that of 1D transform. The principles of 1D discrete-wavelet trans- 
form and fundamentals of wavelet analysis-synthesis filter design were covered in 
Section 3.2.4. Here, we review the key points of that discussion that are relevant 
to image compression. 

In 1D binary discrete wavelet analysis, a signal s(n) is split into two equal-size 
frequency bands, called lower and upper frequency bands, shown in Figure 7.10. 
If we let the normalized sampling frequency be equal to f=1, we need a low-pass 
filter H,( f ) with the passband (0, 1/4) and a high-pass filter H (f) with the pass- 
band (1/4, 1/2) for binary decomposition. Since we employ FIR filters, we have 
to allow an overlap between the passbands of the low-pass and high-pass filters to 
avoid any frequency gaps. The frequency responses of a realizable pair of low-pass 
and high-pass filters that exhibit mirror symmetry about f=1/4 are depicted in 
Figure 7.11. Te outputs of these analysis flters are sub-sampled by 2 to obtain the 
low-pass and high-pass sub-signals, y(n) and y,(n), respectively. Note that, due to 
sub-sampling, the total number of samples in y(n) and y,(n) is equal to the number 
of samples in s(n). 

After compression and decompression, the processed sub-signals j,(”) and 
y,(n) are upsampled by zero insertion and then filtered to reconstruct the processed 
signal $(n). The filters gm) and g (x) are called synthesis filters. Assuming lossless 
compression, i.e., j,(”)=y,() and j,()=y,(n), the filters h{n), h(n), g(n), and 
g,(n) can be designed such that s(”) = s(n). 

The fact that the frequency response of the low-pass filter H,( f) extends into 
the band (1/4, 1/2) and vice versa causes unavoidable aliasing when each sub-band 
signal is decimated by 2. In order to achieve alias-free analysis-synthesis filtering 
(assuming lossless coding), the filters must be designed in such a way that the alias- 
ing introduced by the analysis filter is canceled by the synthesis filter. Hence, the 
analysis-synthesis filters should satisfy the following properties: 
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Figure 7.10 Block diagram of sub-band decomposition and reconstruction filtering. 
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Figure 7.11 Frequency response of 1D binary decomposition filters. 


Alias-cancellation: Perfect reconstruction requires that the filters satisfy (see 
Section 3.2.4) 


FAS IGIS) + CHEF) =2 (7.7) 
HFG IEH Cfie sf) = 0 (7.8) 


. Symmetry: An important concern in image compression is to avoid increasing 


the number of image samples while filtering. Since linear convolution increases 
the number of samples, filtering is implemented by circular convolution, which 
yields the same number of samples as the input. We apply symmetric boundary 
extension in order to avoid introducing unnecessary high-frequency energy due 
to artificial left-to-right and top-to-bottom intensity discontinuities. However, 
to preserve the symmetry after filtering so the number of wavelet coefficients 
to be encoded does not increase, the filter must also be symmetric. Note that 
odd-length symmetric FIR filters are called zero-phase filters. 

Orthogonality: Orthogonal filters implement a projection of the input image 
onto a set of orthogonal wavelet-basis functions. With proper normalization, 
orthogonal transforms preserve energy and norm, as stated by Parseval’s theo- 
rem. This property ensures that the energy of the quantization error committed 
by quantization of the transform coefficients remains unchanged in the pixel 
domain. 
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It has been shown that the only filter that satisfies perfect reconstruction, sym- 
metry, and orthogonality is the trivial case of 2-tap Haar filters, 4{”]={1,1} and 
h [n]={1, —1}, which does not have good energy-compaction property. Hence, in 
order to design FIR filters (compactly supported wavelets) with good energy compac- 
tion (a larger number of vanishing moments), we need to give up either orthogonal- 
ity or symmetry. In modern wavelet image compression, symmetry turns out to be 
more important than orthogonality, since we can design symmetric, bi-orthogonal 
(non-orthogonal) filters that are nearly orthogonal. 

Bi-orthogonal flters: Te conditions (7.7) and (7.8) can be equivalently stated as 


HNG) + A(—fIGk—-f) = 2 
PCF IG ATI + FIG ery 2 
MFG TG = 
MAGA + EIGN =O 
which can be expressed in the spatial domain as bi-orthogonality constraints [Vet 92] 
ltl e [2n-k)= öl] 
(2, [k]; g. [27 — k) = 8[7] 
(g.[4],4[2n—k]) =0 
(g,[A],4,[2%—k]) =0 


where (e) denotes the inner product. In wavelet terms, the filters {h [7], h, [7]} 
are derived from a pair {®, W, }, while the filters {g,[n], g,[n]} are derived from 
another pair {®,,W,} that are related to {®,W,} by the bi-orthogonality con- 
straints. The bi-orthogonality conditions provide us with more flexibility to design 
odd-length symmetric FIR filters. A complete derivation of wavelet filter design is 
beyond the scope of this book. The interested reader is referred to [Vai 87] and 
[Vet 89] for details. 

There is a large family of bi-orthogonal wavelets. Among these (9,7) filters are 
nearly orthogonal and provide good energy compaction. Hence, they were selected 
to be used in the JPEG2000 standard. The coefficients of (9,7) and (5,3) filters are 
tabulated in Tables 7.12 and 7.13, respectively. The wavelet filters possess some regu- 
larity properties that not all orthogonal QMF filters have, which give them improved 
coding efficiency over orthogonal QMF filters with the same number of taps. For 
example, Antonini et a/. [Ant 92] report that they can achieve a performance close to 
that of Woods and O’Neil [Woo 86] by using 9/7-tap filters, whereas the latter uses 
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32-tap Johnston filters. 

The 1D decompositions can be extended to two dimensions by using separable 
filters, i.e., splitting the image s(7,, 7,) first in the row and then in the column direc- 
tion, or vice versa. Using a binary decomposition in each direction, we obtain four 
sub-bands called low (L) y,(7,, 1,), horizontal (H) y,,{7,, 7), vertical (V) yy(n1, 25), 
and diagonal (D) yp(n1, 2,), corresponding to lower-lower, upper-lower, lower-upper, 
and upper-upper sub-bands, respectively. The decomposition can be continued by 
splitting all sub-bands or just the L sub-band repetitively, as shown in Figure 7.12. 
Other decompositions are also possible. 

The wavelet transform coefficients correspond to pixels of respective sub-images. 
In most cases, the decomposition is carried out in multiple stages. The total number 
of samples in all sub-images y,(7,,7,), yy(m,,7), Yy n), and y,(m,,72,) is the 
same as the number of samples in the input image s(n1, 7,) after sub-sampling of the 
sub-images. Thus, the wavelet decomposition itself does not result in data compres- 
sion or expansion. Observe that y,(7,,”,) corresponds to a low-resolution version of 
the image s(7,,7,), while y,(7,,,) contains the high-frequency detail information. 
Therefore, the wavelet decomposition is also known as a “multi-scale” or “multi- 
resolution” representation, and can be used in progressive transmission. 


2LL 


2LH 
2HL 






s(n n,) 
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Figure 7.12 Two-level binary-tree decomposition of an image into frequency bands. 
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Sub-Band Coding vs. Wavelet Image Coding 


Sub-band coding quantizes and encodes coefficients in each sub-band independently 
of other bands. It allocates bit-rates to each sub-band according to a formula based 
on the variance of the coefficients in that sub-band [Vet 84, Woo 86]. However, 
the optimal rate allocation changes as the total rate varies and the allocation and 
encoding must be redone for each rate point. Furthermore, because the rate alloca- . 
tion is based on theoretical quantization models that fit higher bit-rates better, in 
practice it is difficult to match the desired total target bit-rate. As a result, matching 
the target bit-rate or a pre-specified filesize involves some trial and error. Finally, the 
resulting bitstream is not embedded; i.e., bits are not arranged in decreasing order of 
importance, and truncation of the bitstream may yield unpredictable results. Some 
early wavelet image encoders employ similar strategies [Bra 94]. 

Modern wavelet coders leading to the JPEG 2000 standard employ fundamentally 
different techniques that produce embedded bit-streams [Use 01]. Recent wavelet- 
compression methods, such as embedded zero tree wavelet transform (EZW) [Sha 
93] and set partitioning in hierarchical trees (SPIHT) [Sai 96], do not use explicit 
bit-allocation formulae and have introduced two main new ideas: i) dependent cod- 
ing of coefficients across different frequency bands using zero trees and set partition- 
ing in hierarchical trees and ii) embedded bit-stream generation using successive 
approximation quantization, which both have influenced JPEG 2000. 


7.3.2 ISO JPEG2000 Standard 


Unlike EZW and SPIHT, JPEG2000 does not exploit correlation between sub- 
bands. JPEG2000, based on the embedded block coding with optimized truncation 
(EBCOT) [Tau 00], encodes each sub-band independently. It divides an image into 
tiles (non-overlapping rectangular blocks). A color transform is applied to tiles. Each 
color component of each tile is input to the block diagram shown in Figure 7.13, 
where computation of DWT is followed by uniform quantization of the DWT coef- 
ficients. The quantized coefficients are then divided into rectangular code blocks, 
which are entropy coded independently. In Tier-1 (T1) coding, the entropy encoder 
generates an embedded bitstream for each code block by coding wavelet coefficients 
bit-plane-by-bit-plane. In Tier-2 coding, a rate-allocation algorithm can be used to 
aggregate T1 bitstreams into a single interleaved output bitstream in the order of 
significance of all encoded bits. Each block is described next. Details can be found 
in [Ima 02, ISO 01]. 
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Figure 7.13 Block diagram of JPEG2000 operation. 


Pre-Processing and Color Transforms 


Tiling partitions the input image into rectangular, non-overlapping blocks of equal 
size (except possibly at the image borders). The tile size is arbitrary and can be as 
large as the whole image. Each tile is compressed independently with its own set of 
compression parameters. 

Next, unsigned pixel values represented by B bits (in each channel) are level- 
shifted by subtracting the fixed value 22~/ to make pixel values symmetric about 0. 
Level-shifted pixel values are subject to color transform. JPEG2000 defines two color 
transforms: irreversible color transform (ICT) and reversible color transform (RCT). 
ICT is the traditional RGB to YCrCb color transform, used in lossy image compres- 
sion, given by 


Y= 0.299R + 0.587G + 0.114B 
Cr= 0.713(R = F) 
Cb = 0.564(B — Y) 


or equivalently 


Y 0.299 0.587 0.114 || 
Cr|=| 0.500 —0.419 —0.081 |G 
cp| | —0.169 —0.331 0.500 ||g 


RCT can be used in both lossy or lossless compression, and is given by 


r= 
4 
U=R-G 
V=B-G 
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The inverse of RCT is given by 





U+V 
G=Y- 
fea 
R=U+G 
B=V+G 


Wavelet Filters 
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Part 1 of JPEG2000 includes two choices of DWT filterbank. They are the 
Daubechies (9,7) floating-point filterbank, which provides superior compression 
performance, and the (5,3) filterbank for use in lifted integer-to-integer (reversible) 
DWT implementation. The coefficients of the linear-phase FIR filters, 4)(7), h, (7), 
&(n)> g,(n), are listed in Tables 7.8 and 7.9. Both flterbanks are bi-orthogonal, 
which means that h(n) and g, (7), are orthogonal and go(n) and 4, (7) are orthogonal. 

The output wavelet coefficients are floating-point numbers when using floating- 
point (9,7) DWT filterbanks. The lifting method to compute DWT coefficients 
provides a way to compute an integer-to-integer DWT. JPEG2000 standard defines 
conversion of the (5,3) filterbank into an integer-to-integer transform. 


Table 7.8 Daubechies (9,7) Floating-Point Filterbank 

n Low-pass, h,(n) 

0 +0.602949018236360 

EN +0.2668641 18442875 

t2 —0.078223266528990 

£3 —0.0168641 18442875 

+4 +0.026748757410810 

n High-pass, h, (n) n 

—] +1.115087052457000 1 
—2:0 —0.591271763114250 0,2 
=l —0.057543526228500 =] 3 
—=4.) +0.091271763114250 =204 


Low-pass, go(n) 


+1.115087052457000 
+0.591271763114250 
—0.057543526228500 
—0.091271763114250 


High-pass, g,(n) 

+0.602949018236360 
—0.266864118442875 
—0.078223266528990 
+0.0168641 18442875 
+0.026748757410810 
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Table 7.9 (5,3) Filterbank 





n h(n) go(n) n h (n) n g,(n) 
0 +3/4 +1 =] + |] 1 +3/4 
£1 +1/4 +1/2 2 — 1/2 0,2 1/4 
+2 —1/8 mc Us —1/8 





Filter Normalization 


DWT filters are often normalized in terms of DC gain 


Groc = » h(n) 





for low-pass filters and the Nyquist gain 








Cy. = EA 1)” b(n) 
for high-pass filters. The (9,7) and (5,3) analysis filterbanks have been normalized 
such that Gpc=1 and Gy, =2. This is referred to as (1,2) normalization, which 
is adopted in Part 1 of JPEG 2000. Other forms that appear in the literature are 
(J2,V2 ) and (1,1) normalizations. Once normalization of the analysis filterbank 
is specified, normalization of synthesis filterbank is automatically determined by 
reversing the order and multiplying by a scalar factor c. 


Boundary Extension 


In order to cope with image border effects, each line of the image is extended by the 
symmetric and periodic boundary extension scheme when using odd-tap filters as 
shown in Figure 7.14. 


Quantization at the Encoder 


Similar to the original JPEG standard, JPEG2000 employs uniform quantization 
of the wavelet coefficients, with one step-size for each sub-band. An important dif- 
ference is in the inclusion of a central deadzone. It has been shown that the R-D 
optimal quantizer for a signal with Laplacian probability density (such as DCT or 
wavelet coefficients) is a uniform quantizer with a central deadzone. The size of 
the optimal deadzone as a fraction of the step-size increases as the variance of the 
Laplacian distribution decreases. In Part 1, the size of the deadzone is taken as twice 
the step-size due to its optimal embedded structure. Tis is shown in Figure 7.15. 
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Original 





Figure 7.14 Symmetric and periodic boundary extension. 
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Figure 7.15 Uniform quantizer with a deadzone. 


Embedded quantization means that an M, bit quantizer index resulting from 
step-size A, is transmitted progressively starting with the most-significant bit (MSB) 
and proceeding toward the least-significant bit (LSB). The resulting index after 
decoding only N, bits corresponds to a quantizer with a step-size A, 2 ™. 

The quantization at the encoder is performed according to 


hk, 
gs (hi,k,) = sip (yee) e A 
b 


where the step-size A, is represented by an 11-bit mantissa u, and 5-bit exponent 
€, as follows: 


— JRE H 
A, =2 [+4] 


where R, is the number of bits used to represent the nominal dynamic range of sub- 
band 8. 
JPEG2000 allows two modes to signal the value of A, to the decoder: 


1. Expounded quantization: One (€,,,) value for every sub-band is explicitly 
transmitted. 

2. Derived quantization: A single (sw 1.0) value is sent for the LL sub-band. Values 
for other sub-bands (€,, 4,) are derived by scaling A, value 


(E; Mg) = (so = N; ae Nyy Mo) 


where N; is the total number of decomposition levels and n, is the decomposi- 
tion level for sub-band 8. 
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Inverse Quantization at the Decoder 


The inverse quantization is performed by biased sample reconstruction (instead of 
the usual mid-point reconstruction) defined by 


(q, (kok)+y)A, ifg,(k,k)>0 
Rq, (kk) = (9, (kik) —y)A, if q, (k k,) <0 


0 otherwise 


where 0 S y < 1 isa bias parameter. y = 0.5 corresponds to mid-point reconstruc- 
tion. A value of y < 0.5 creates a bias towards zero, which results in improved recon- 
struction PSNR when probability distribution of wavelet coefficients falls off rapidly 
away from zero. A popular choice is y = 0.375. 


Entropy Coding 


JPEG2000 constructs an embedded bitstream by bit-plane encoding of quantizer 
indices. Bit-plane coding of wavelet coefcients, which is illustrated in Figure 7.16, 
has also been used by other embedded wavelet coders such as EZW and SPIHT. 
JPEG 2000 uses a block-coding paradigm in the wavelet domain, where each sub- 
band is partitioned into small rectangular blocks, called code blocks, that are also 
coded independently. While this results in a small loss of compression efficiency, it 
has several other’benefits such as improved error resilience, flexible arrangement of 
progression orders, localized random access, and improved cropping. 

During progressive encoding of bit-planes, a quantized wavelet coefficient is 
called insignificant if the quantizer index bit is still zero. The coefficient becomes sig- 
nificant when the first non-zero bit is encountered, then the sign of the coefficient is 
also encoded. Once a coefficient becomes significant, all subsequent bits are referred 
to as refinement bits. All bits are encoded using context-based adaptive binary arith- 
metic coding to form the Tier-1 bitstream. 





Figure 7.16 Progressive encoding of bit-planes. 
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Region-of-Interest Coding 


Region-of-interest (ROI) coding has two functionalities: i) to compress the ROI 
with higher fidelity when the full bitstream is decoded, and ii) to transmit the ROI 
before the background region in case the bitstream is partially decoded. Both func- 
tionalities are achieved by the Maxshift method. The basic principle of the Maxshift 
method is to scale up (shift up) the wavelet coefficients in the ROI. In particular, 
the bits associated with the ROI are shifted up such that all the bits of wavelet coef- 
ficients in the ROI are in higher bit-planes than the most significant bit of all other 
pixels within that tile. During the embedded coding process, coded bits for higher 
bit-planes are placed prior to the background bits in the final bitstream. No shape 
information for the ROI needs to be included in the bitstream. 


Error Resilience 


Error resilience tools include compressed data partitioning, resynchronization, and 
error detection. Error-resilient bitstream syntax and tools are provided both at the 
entropy coding and packetization levels. The independent code-block coding approach 
employed in JPEG2000 leads to improved error resilience. Finally, JPEG2000 speci- 
fies an optional file format called JP2. 
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Exercises 


7.1 Suppose we have a DMS with the alphabet A and the symbol probabilities 
p(a,), a; € A specified by the following table: 





a. Find the entropy of this source. 
b. Design a Huffman code for this source. 
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c. Find the average codeword length. 
d. How good is this code? 
e. Can Huffman coding result in data expansion? Explain why or why not. 


Suppose we have a binary source with symbol probabilities P(0)=0.75 and 

P(1)=0.25. 

a. Design an arithmetic code for the following sequence of symbols: 
0000110010 


b. What is the entropy of the source? 
c. What is the average codeword length for this instance of the source? 
Explain. 


Explain why the ITU Group 3 and Group 4 codes may result in data expan- 
sion with half-tone images. How does the JBIG standard address this problem? 


What is the primary motivation of using the Gray codes in bit-plane encoding? 


Suppose a random variable X that is uniformly distributed between 0 and 10 is 
uniformly quantized with JN levels. 

a. Find all decision and reconstruction levels. 

b. Calculate the mean-square quantization error. 


Suppose we have three scalar random variables, X,, X,, and X, with variances 
2.5, 7, and 10. Propose a method to allocate a total of 20 bits to quantize 
each random variable independently to minimize the mean of the sum of the 
square-quantization errors. 


Let X, 2=..., —1,0, 1, ... denote a sequence of zero-mean scalar random vari- 
ables with 


E{X}} =o} and E{X,X, ,}=a 
We define the prediction-error sequence 
E; ~ x, =P > 

where p is such that 


E{(X, — pX,_,)X,} =0 forall z 


a. Find the value of p. 
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b. Let D, and D, denote the distortion incurred by uniform quantization of 
X, and €,, respectively, using r bits. Show that Di is always greater than or 
equal to D,. 


Compute the Karhunen—Loeve transform, DFT, and DCT of the block 


shown below. 





Compare these transforms on the basis of energy compaction (decorrelation) 
and similarity of the basis functions. 


Explain why zigzag scanning is used in DCT compression. 


How do you achieve bit-rate (bits/pixel) vs. quality (PSNR) tradeoff in JPEG 
image coding? 


How does the JPEG algorithm take advantage of spectral (color) and percep- 
tual (psychovisual) redundancy? 


Elaborate on the relationship between the subband coding and the DCT cod- 

ing. In particular, consider a 64X64 image. Suppose we partition the image 

into 8X8 blocks and compute their DCT. 

a. Show that we can construct 64 subband images, each of which is 8X8, by 
an ordering of the DCT coefficients. 

b. Discuss the relationship between DPCM encoding of the DC coefficients 
in JPEG and DPCM encoding of the lowest subband in this case. 


Show that a perfect reconstruction filterbank cannot satisfy the following 
requirements simultaneously: i) it is FIR symmetric (zero-phase) with filter 
length greater than 2 and ii) analysis and synthesis filters are orthogonal. 
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Internet Resources 
J. Abel, The Data Compression Resource on the Internet 


http://www.data-compression.info 


M. Dipperstein, Lempel-Ziv-Welch Encoding 
http://michael.dipperstein.com/|zw/ 


Markus Kuhn, JBIG-KIT (JBIG1) 
http://www.cl.cam.ac.uk/~mgk25/jbigkit/ 


JBIG2 Encoder and Decoder 
http://www.ghostscript.com/jbig2dec.html 
https://github.com/agl/jbig2enc 


CharLS, a JPEG-LS library 
http://charls.codeplex.com/ 


OpenJPEG Homepage—JPEG2000 codec 
http://www.openjpeg.org/ 


Image and Video Compression Learning Tool VCDemo, TU Delft, 2011 


http://insy.ewi.tudelft.nl/content/image-and-video-compression-learning-tool-vcdemo 


CHAPTER & 


Video Compression 





Video compression is a critical technology that has emerged over the past four 
decades and has enabled complete transition from analog to digital video in 

all entertainment and communication industries, including digital TV, digital 
cinema, and visual communications. It has revolutionalized how we consume and 
communicate visual media, making the Internet the premiere media and visual 
information-exchange environment. 


The high bit-rate requirements for uncompressed standard and high-definition video 
formats make the need for video compression evident. Different industries and appli- 
cations may have different video resolution and quality requirements, ranging from 
lossless compression for studio editing and archiving to lossy compression at various 
target bit-rate and quality points depending on the available display size and resolu- 
tion as well as transmission bit-rate. In order to respond to these varying require- 
ments, a number of compression algorithms and standards have been developed. 
The simplest approach to video compression would be to employ a still-frame 
compression technique, such as JPEG or JPEG 2000, on a frame-by-frame basis 
to provide random access to any frame. However, the compression ratio that can 
be achieved by this approach is limited because inter-frame (temporal) redundan- 
cies are ignored. Inter-frame-compression methods, which take advantage of tempo- 
ral redundancy by exploiting similarity between neighboring frames, e.g., through 
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motion-compensated (MC) coding, provide superior compression efficiency. This 
chapter first presents different video-compression approaches, including a brief dis- 
cussion on how to best exploit the temporal redundancy, and then introduces vari- 
ous international standards that employ the MC-transform-coding strategy. 


8.1 Video-Compression Approaches 


A variety of approaches, ranging from MC transform coding and 3D-transform cod- 
ing, including discrete cosine transform (DCT) and wavelet transform, to vector 
quantization, fractal coding, and 2D and 3D model-based coding, have been inves- 
tigated for video compression in the literature. A comprehensive review of model- 
based coding techniques can be found in [Aiz 95]. While some of these techniques 
have been successful for specific types of video at very low bit-rates, we only discuss 
the mainstream approaches of intra-only coding, 3D-transform coding, and block- 
motion compensated coding. Indeed, modern video-coding applications and stan- 
dards employ only intra-coding and MC transform coding. 


8.1.1 Intra-Frame Compression, Motion JPEG 2000, and 
Digital Cinema 

Intra-frame compression means each picture is encoded independently using the 
image-compression techniques discussed in Chapter 7; i.e., compressed pictures do 
not depend on any data from the preceeding or succeeding pictures. Intra-only com- 
pression has some advantages such as: 


。 It enables random access to each frame, and 
。 It has low complexity (no multiple frame stores and motion estimation). 


However, the compression efficiency of intra-only methods is limited because inter- 
frame (temporal) redundancies cannot be exploited. Nevertheless, intra-only coding 
has found applications in consumer/professional media and digital cinema capture, 
studio editing, and archiving. 

The consumer digital video format DV and its variants DVCPRO and DVCAM 
used in camcorders as well as DVCPRO-50 and DVCPRO HD used in professional 
television production employ intra-only video coding at 25 Mbits/s to 50 Mbits/s 
data streams. They all use DCT-based JPEG-like intra-compression. The DV algo- 
rithm yields higher quality than the JPEG scheme at the nominal 5:1 compression 
ratio by allowing for optimization of quantizing tables within a frame. It also uses 
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adaptive inter-field compression; i.e., it compresses the two fields of an interlaced 
frame together if little or no motion is detected to allow for higher overall quality. 

Motion JPEG 2000 (MJ2K) refers to using the JPEG 2000 still image- 
compression method for a motion picture or video sequence in a frame-by-frame 
manner. In addition to achieving better compression efficiency than the DCT- 
based JPEG, JPEG 2000 provides the ability to support lossless, near lossless, and 
lossy encoding in a single embedded bit-stream, which makes MJ2K attractive for 
the motion-picture industry to use in the entire workflow, including digital pro- 
duction, post-production, projection, and archiving. The compliance point 3 of 
MJ2K covers production and projection formats, with image sizes up to 4096 X 
3112, 4:4:4 color with up to 16 bits/color, and up to five transform layers. The 
compression efficiency of MJ2K was significantly favorable compared to using 
an intra-only AVC/H.264 video codec for high- and ultra-high-definition video 
[Mar 03]. 


8.1.2 3D-Transform Coding 


3D-transform coding refers to straightforward extension of both 2D-DCT and 
wavelet-based coding approaches to video encoding using spatio-temporal (3D) 
transforms. 


3D-DCT Coding 


In 3D-DCT coding, video is divided into MX NX blocks, as depicted in Figure 
8.1 where M, N, and J denote the horizontal, vertical, and temporal dimensions of 
the block, respectively. The transform coefficients are quantized and encoded similar 
to 2D-transform coding of still-frame images. Since the DCT coefficients are related 





Figure 8.1 3D-transform block in 3D-transform coding. 
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to the frequency content of blocks, the energy for most blocks will be packed in the 
low-spatial and temporal-frequency zone, and for temporally stationary blocks, the 
temporal DCT coefficients will be close to zero and will be truncated. The 3D-DCT 
coding does not require a separate motion-estimation step. However, it requires / 
frame stores both in the encoder and decoder. Therefore, J is typically chosen as 2 or 
4 to allow for inexpensive hardware implementations. Note that random access to 
video is possible once for every /th frame. 

The hybrid-transform/DPCM coding strategy implements differential pulse- 
code modulation (DPCM) of 2D-transform coefficients in the temporal direction to 
overcome the multiple frame-store requirement of 3D-transform coding [Roe 77]. 
It employs a 2D-DCT on each spatial block within a given frame. Then, a bank of 
DPCM coders, each tuned to the statistics of a specific DCT coefficient, is applied 
to the transform coefficients in the temporal direction. That is, differences in the 
respective DCT coefficients in the temporal direction are quantized and encoded. 
Results comparable to that of 3D-DCT coding can be obtained with proper adapta- 
tion of the DPCM quantizers to the temporal statistics of the 2D-DCT coefficients 
[Roe 77]. Note that neither 3D-DCT coding nor hybrid DCT/DPCM coding has 


been widely used in practical applications. 


3D-Wavelet/Sub-Band Coding 


The development of 3D-wavelet/sub-band coding has been encouraged by the 
fact that: i) it often does not cause blocking artifacts, which is a common problem 
with 3D-DCT and MC-DCT coding especially at low bit-rates, ii) unlike MC- 
compression methods, it does not require a motion-estimation stage, and iii) it 
is inherently scalable, both spatially and temporally. Scalability, which refers to 
accessing digital video at various spatial and temporal resolutions without having 
to decompress the entire bit-stream, has become an important property due to the 
growing need for storage and transmission of digital video in various definition 
standards. 

In 3D-wavelet/sub-band coding, video is decomposed into a number of properly 
sub-sampled component signals, ranging from a low spatial and temporal-resolution 
component to various higher-frequency detail components. These component video 
signals are encoded using algorithms adapted to the statistical and pyschovisual prop- 
erties of the respective spatio-temporal frequency bands. Compression is achieved by 
quantization of the various components and entropy coding of the quantized values. 
Higher-resolution video, in both spatial and temporal coordinates, can be recovered 
by combining the decompressed low-resolution and detail components. 
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Figure 8.2 3D-wavelet decomposition with 11 bands: (a) system-block diagram and (b) spatial 
and temporal bands, where LP, and HP, denote LP and HP temporal bands, respectively. 


Most 3D-wavelet/sub-band decompositions use 2- or 4-frame temporal blocks 
at a time due to practical implementation considerations. Typically, the temporal 
decomposition is based on a simple 2-tap Haar filterbank [Luo 94], which gives 
the average and difference of two frames for the low-pass (LP) and high-pass (HP) 
temporal bands, respectively. In the second stage, both LP and HP temporal sub- 
bands are decomposed into LP and HP horizontal sub-bands, respectively. In the 
next stage, each of these bands are decomposed into LP and HP vertical sub-bands. 
Subsequently, the LP temporal-LP horizontal-LP vertical band is further decom- 
posed into four spatial sub-bands to yield an 11-band decomposition, as depicted in 
Figure 8.2(a). Spatial decomposition of the LP and HP temporal bands are depicted 
in Figure 8.2(b). Longer wavelet filters can be applied for spatial (horizontal and ver- 
tical) decompositions, since these filters can be operated in parallel and do not affect 
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frame-store requirements. Various approaches to compressing individual component 
signals are discussed in [Vet 92, Bos 92, Luo 94, Cho 99]. 


8.1.3 Motion-Compensated Transform Coding 


The main idea in MC video compression is to encode the “new” information that 
is not present in the previously encoded frames. The earliest such approach was the 
so-called conditional replenishment (CR) technique [Has 72], which segments each 
frame into “changed” and “unchanged” regions with respect to the previous frame, 
and encodes only addresses and intensities of pixels in the changed regions. Since the 
amount of changed information varies from frame to frame, the information to be 
transmitted needs to be buffered, and quantization should be regulated according 
to buffer fullness. Note that conditional replenishment is a motion-detection based 
algorithm rather than an MC algorithm, since it does not require explicit estima- 
tion of the motion vectors. It is the basis of the “skip” mode in the modern video- 
compression standards. 

MC techniques characterize the temporal correlation in a video by using motion 
vectors rather than through the respective transform coefficients. The earliest MC 
compression scheme was the MC-DPCM, which extends the CR method to encode 
the displaced frame difference for pixels in the changed region with respect to the 
previous frame. It yields better compression than CR provided that displacement 
vectors can be accurately estimated. MC-DPCM uses pixel (pel) recursive algorithms 
for motion estimation. 

The most popular MC-coding approach is MC-transform coding, where the tem- 
poral-prediction error after MC (displaced-block difference) is 2D-DCT encoded. 
The temporal prediction aims at minimizing the temporal redundancy, while the 
DCT encoding makes use of the spatial redundancy in the prediction error. MC- 
transform coding algorithms feature several modes to incorporate both progressive 
and interlaced inputs. These include intra-field, intra-frame, and inter-field and 
inter-frame prediction with or without motion compensation. In the inter-field and 
inter-frame modes the prediction is based on the previous field or frame, respectively. 
The basic MC-transform coding scheme employs block-based motion estimation 
and compensation. It has been argued that block-motion models are not realistic 
for most image sequences, since moving objects hardly ever manifest themselves as 
rectangular image blocks. Recently, variable block-size motion-compensation with 
more than a single motion vector per macroblock has been shown effective with- 
out increasing motion vector overhead. Almost all international standards for video 
compression employ the MC coding strategy. 
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8.2 Early Video-Compression Standards 
8.2.1 ISO and ITU Standards 


There are two major international organizations that develop video-compression 
standards: International Telecommunications Union (ITU-T) (formerly CCITT) 
and International Standards Organization (ISO) in collaboration with the Interna- 
tional Electro-technical Committee (IEC). The ITU-T Video Coding Experts Group 
(VCEG) used to deal with bi-directional real-time video-communications standards, 
and ISO/IEC Moving Picture Experts Group (MPEG) used to deal with video com- 
pression from the viewpoint of information technology. Considering the significant 
overlap and similarity between these standards, the two organizations recently joined 
forces to develop joint standards. Table 8.1 presents international video-compression 
standards in chronological order of their development. MC-transform coding is the 
basis of all video-compression standards, which are summarized in Table 8.1. 
Historically, the first video-compression standard based on MC-transform cod- 
ing was the ITU-T Recommendation (Rec.) H.261 in order to enable videophone 
and videoconferencing services over the integrated services digital network (ISDN) 
at p X 64 kbps, p= 1,..., 30 [Che 93]. Rec. H.261 emerged as a result of studies 
performed within the European COST (CoOperation in the field of Scientific and 
Technical research) Action 211 during 1983-1990. In 1985, COST 21 1bis devel- 
oped a codec operating at bit rates of n X 384 kbps, n = 1, ...,5, which was adopted 
by ITU-T as Rec. H.120 in 1987. Later, it became clear that a single standard can 
cover all ISDN rates, p X 64 kbps, p= 1,..., 30. In addition to forming the basis 
for later video-compression standards such as MPEG-1 and MPEG-2, Rec. H.261 
offers some important features: i) Unlike JPEG, it does not define specific encoders 
to produce valid bit-streams; instead, flexibility is allowed in designing conformant 
encoders. ii) It introduces the MC block DCT architecture, which is amenable to 


Table 8.1 International Standards for Video Compression 


Standard Approved By First Edition Short Description 
H.261 ITU-T 1988 Obsolete 
MPEG-1 ISO/IEC 1993 Obsolete 
MPEG-2/H.262 ISO/IEC, ITU-T 1996 Digital broadcast 
H.263 ITU-T 1996 Obsolete 
MPEG4-AVC/H.264 ISO/IEC, ITU-T 2003 All purpose 


MPEG HEVC/H.265 ISO/IEC, ITU-T 2013 All purpose 
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low-cost hardware implementations. iii) It is limited by a maximum coding delay of 
150 msec., since it is intended for bi-directional video communication. It has been 
observed that delays exceeding 150 msec. do not give the viewer impression of direct 
visual feedback. Although Rec. H.261 is obsolete by now, it influenced future video- 
coding standards, which we discuss in detail below. 


8.2.2 MPEG-1 Standard 


MPEG-1 is an ISO/IEC standard that was developed for storage of digital video and 
its associated audio at about 1.5 Mbps on various digital-storage media such as CD- 
ROM, digital audio tape (DAT), and optical drives. The main parts of the MPEG-1 
standard are Systems (ISO/IEC 11172-1), Video (ISO/IEC 11172-2), and Audio 
(ISO/IEC 11172-3). Here, we concentrate on MPEG-1 video. MPEG became active 
in 1988. The definition of the video algorithm (committee draft) was completed in 
1990. MPEG-1 was approved as an international standard in early 1993. MPEG -1 is 
a generic standard in that it standardizes the syntax for representation of an encoded 
bit-stream and a method of decoding, including motion-compensated prediction 
(MCP), discrete-cosine transformation (DCT), quantization, and variable-length 
coding (VLC). MPEG-1 does not standardize a motion-estimation algorithm or a 
method for selecting the compression mode. The parameters defining the coded bit- 
stream and decoders are contained in the bit-stream itself. This allows it to be used 
with pictures of various sizes and aspect ratios at a range of bit-rates. The quality of 
MPEG-1 compressed/decompressed video at about 1.2 Mbps (video rate) was found 
to be similar (or superior) to that of VHS-recorded analog video [Gal 91, Chi 95]. 
The MPEG-1 video compression is similar to that of H.261 with the following 
new features: i) MPEG-1 offers random-access points to stored video sequences by 
introducing I-frames, which consist of only intra-coded macroblocks. In addition, 
MPEG-1 supports trick modes (fast-forward and fast-reverse functions) for digi- 
tally stored video. ii) Two other frame types have also been introduced: P-pictures 
that are coded in reference to a previous I- or P-picture and B-pictures that are 
coded by bi-directional MC using a previous and a future reference picture (both 
possibly multiple frames away from the current picture). In contrast, H.261 does 
not support bi-directional MC and allows MC only over the immediately previ- 
ous frame. iii) MPEG-1 supports half-pixel MC and no loop filter. iv) MPEG-1 
features visually weighted quantization of DCT coefficients. v) MPEG-1 has a 
flexible slice structure. vi) MPEG-1 allows for separate VLC tables for intra- and 
inter- (P or B) coded macroblocks. vii) There are no picture size or bit-rate restric- 
tions except for the constrained parameters. MPEG-1 also supports flexible frame 
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rates. viii) MPEG-1 offers a reasonable coding/decoding delay of about 1 sec to 
provide unidirectional interactive access to video. The coding delay in H.261 was 
strictly limited to 150 msec to maintain bi-directional interactivity. We elaborate 
on some of these features in the following. 


Input-Video Format and Data Structure 


MPEG-1 allows only progressive (non-interlaced) video as input. Therefore, 525/30 
and 625/25 interlaced video must be converted into standard input format (SIF), 
which is either 352 X 240 at 30 frames or 352 X 288 at 25 frames (both progressive) 
for 525/30 and 625/25 analog video, respectively. The smaller spatial dimensions are 
required in order to reach the target bit-rate of 1.5 Mbps with acceptable video qual- 
ity. The (Y,Cr,Cb) color space was adopted, as in ITU-R (CCIR) Recommendation 
601. In the MPEG-1 SIF, the chroma components are sub-sampled by 2 in both the 
horizontal and vertical directions (4:2:0 format). 

The MPEG-1 bit-stream follows a hierarchical data structure, consisting of the 
following six layers, which enables the decoder to interpret the data unambiguously: 


1. Sequences are formed by several groups of pictures. 

2. Groups of pictures (GOP) are made up of pictures. A GOP of size N contains V 
pictures. 

3. Pictures consist of slices. There are four picture types indicating the respec- 
tive modes of compression: I-pictures, P-pictures, B-pictures, and D-pictures. 
I-pictures consist of only intra-frame DCT-encoded macroblocks. They serve 
as random-access points to the video. There are two types of inter-frame- 
encoded pictures: P- and B-pictures. These pictures contain MC predictive- 
encoded macroblocks. Only forward prediction is allowed in the P-pictures, 
which are always encoded relative to the preceding I- or P-pictures. The predic- 
tion of B-pictures can be forward, backward, or bi-directional relative to other 
I- or P-pictures. D-pictures contain only the DC component of each block 
and facilitate video browsing at very low bit-rates. The number of I-, P-, and 
B-pictures in a GOP are application-dependent, e.g., dependent on access-time 
and bit-rate requirements. The number of B-pictures between successive refer- 
ence (anchor) pictures is denoted by M— 1, where M is the distance between 
the anchor pictures (known as prediction distance). 

4. Slices are made up of macroblocks. They are introduced for error recovery. 

5. A macroblock (MB) consists of four 8 X 8 luminance blocks and the spatially 
associated chroma blocks similar to JPEG. Some compression parameters can 
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Table 8.2 Macroblock Types in MPEG-1 


I-pictures P-pictures B-pictures 
Intra Intra Intra 
Intra-A Intra-A Intra-A 
Inter-D Inter-F 
Inter-DA Inter-FD 
Inter-F Inter-FDA 
Inter-FD Inter-B 
Inter-FDA Inter-BD 
Skip Inter-BDA 
Inter-I 
Inter-ID 
Inter-IDA 
Skip 


be varied on an MB by MB basis. The MB types are listed in Table 8.2. We will 
take a closer look at each of these MB types in the following subsections. 

6. Blocks are 8 X 8 pixel arrays. They are the smallest DCT unit. Headers are 
defined for sequences, GOPs, pictures, slices, and MBs to uniquely specify the 
data that follows. For an extensive discussion of the MPEG-1 standard, the 
reader is referred to [Gal 91, Pen 93]. 


With the introduction of B-pictures, the encoder/decoder must have sufficient 
memory to store at least two decoded pictures, and the “encoding/decoding order” 
_ of pictures will be different from their sequential “display order.” The composition 
of a GOP and the concepts of “display order” vs. “encoding/decoding order” are 
illustrated by the following example. 


Example: Display Order vs. Encoding Order of Pictures in a GOP 
A GOP of size N= 9 and M= 4 is shown in Figure 8.3. The first frame of 


each GOP is always an I-picture. The pictures are encoded in the order 
0; 4, 25 35 8, 5; 6,7 


since the prediction for P- and B-pictures should be based on pictures that 
are already transmitted. Picture 4 is encoded in reference to picture 0, picture 
1 is encoded in reference to picture 0 and picture 4, etc. 
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Figure 8.3 Group of pictures in MPEG-1. 
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Intra-Frame Compression Modes 


In intra-coded MBs, pixel-intensity values are DCT encoded in a manner similar to 
JPEG or the intra-mode of H.261. Compression is achieved by quantization of the 
DCT coefficients and VLC of the resulting coefficients. 

For 8-bit images, the DC coefficient can take values in the range [0,2047], and 
the AC coefficients are in the range [-1024,1023]. These coefficients are quantized 
with a uniform quantizer. Quantized coefficients are obtained by dividing the DCT 
coefficients by the step-size of the respective quantizer and then rounding the result 
to the nearest integer. These step-sizes are arranged into an 8 X 8 matrix called the 
quantization matrix. The step-size for the DC coefficient is set equal to 8 for both 
luma and chroma samples; thus, quantized DC values are in the range [0,255]. The 
AC coefficients can be represented with less than 8 bits using step-sizes larger than 8. 
The quantizer step-size for AC coefficients varies by frequency, according to the rela- 
tive visual importance of each DCT coefficient. MPEG-1 restricts quantized AC 
coefficients to be in the range [-255,255]. 

MPEG-1 allows for spatially-adaptive quantization by introducing a quantizer scale 
parameter “MQUANT” in the syntax of each MB. As a result, there are two types of MBs 
in I-pictures: “intra” MBs are coded with the current quantization matrix. In “intra-A” 
MBs, the quantization matrix is scaled by MQUANT, which is transmitted in the header. 
MQUANT attains integer values between 1 and 31. Human visual system models sug- 
gest that MBs containing busy, textured areas can be quantized relatively coarsely. One 
of the primary differences between MPEG intra-mode and the original JPEG is the pro- 
vision of adaptive quantization in MPEG. It has been claimed that MPEG intra-mode 
provides 30% better compression compared with JPEG due to adaptive quantization. 
Furthermore, MQUANT can be used as an effective tool for rate control. 
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Similar to JPEG, MPEG-1 enables prediction of the DC coefficient of each block 
from that of the previous block. The difference between successive DC values is VLC 
coded with 8 bits. The fixed-DC Huffman table has a logarithmic amplitude cat- 
egory structure borrowed from JPEG. Quantized AC coefficients are zigzag scanned 
and converted into [run, level] pairs as in JPEG. A single Huffman-like code table, 
which is different from that of JPEG, is used for all blocks, independent of the color 
component they belong to. There is no provision for downloading custom tables. 
Only those pairs that are highly probable are VLC coded. The rest are coded with an 
escape symbol followed by a fixed-length code to avoid long codewords. 


Inter-Frame Compression Modes 


In inter-frame compression modes, a temporal prediction is formed, and the result- 
ing prediction error is DCT encoded. There are two types of temporal-prediction 
modes allowed in MPEG-1: forward prediction (P-pictures) and bi-directional pre- 
diction (B-pictures). 

P-pictures allow MC forward-predictive coding with reference to a previous I- or 
P-picture. The forward temporal prediction b for an MB b in the current frame k 
is given by 


b=č 


where č denotes the MB corresponding to b in the reconstructed previous frame as 
illustrated in Figure 8.4. Note that the reference picture is not necessarily the imme- 
diately previous picture. 

The encoder selects the best compression mode for each MB from the list of 
allowable modes for a P-picture, which is shown in Table 8.2. “Intra” and “intra-A” 
MBs are also allowed in P-pictures, and may be selected for efficient compression 
of uncovered regions. MBs classified as “inter” are inter-frame coded, and the tem- 
poral prediction may or may not use motion compensation (MC) and/or adaptive 
quantization. The subscript “D” indicates that the prediction error will be coded, 
“F” indicates that forward MC is ON, and “A” indicates adaptive quantization (a 
new value of MQUANT is transmitted). That is, if an MB is labeled “inter-F” then 
the MCP b is satisfactory, so it suffices to transmit just the motion vector d for that 
MB; “inter-FD” indicates that we need to transmit a motion vector and the DCT 
coefficients of the prediction error; and “inter-FDA” indicates that in addition to a 
motion vector and the DCT coefficients, a new value of MQUANT is also being 
transmitted for that MB. A macroblock may be “skipped” if the block at the same 
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Figure 8.4 MPEG forward prediction. 
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Figure 8.5 MPEG bi-directional prediction. 


position in the previous frame (without MC) is good enough, indicating a stationary 
“unchanged” area. 

B-pictures allow MC interpolative coding, also known as bi-directional predic- 
tion. The temporal prediction for the B-pictures is given by 


b= ač, +a,€, aas =0,0.5,1 and a, +a, =1 


where ~ denotes the “reconstructed” value. Then a, = 1 and a, = 0 yields forward 
prediction, a, = 0 and a, = 1 gives backward prediction, and a, = a, = 0.5 corre- 
spond to bi-directional prediction. This is illustrated in Figure 8.5. Note that in the 
bi-directional prediction mode, two displacement vectors d, and d, and the corre- 
sponding prediction error b 一 b need to be encoded for each macroblock b. 
Bi-directional prediction or interpolative coding can be considered as a tem- 
poral multi-resolution technique, where we first encode only the I- and P-pictures 
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(typically 1/3 of all frames). Then the remaining frames can be interpolated from 
the reconstructed I- and P-frames, and the resulting interpolation error is DCT 
encoded. The use of B-pictures provides several advantages: 


e They provide an effective means for handling problems associated with cov- 
ered/uncovered regions. For example, if an object is newly uncovered in the 
present frame, it can be non-causally predicted from the next frame. 

e MC averaging over two frames may provide better SNR compared to predic- 
tion from just one frame. 

。 Since B-pictures cannot be used in predicting any future pictures, they can be 
encoded with fewer bits without causing error propagation. 


The trade-offs associated with using B-pictures are: 


e Two frame stores are needed at the encoder and decoder, since at least two refer- 
ence (P- and/or I-) frames should be decoded first. 

。 Iftoo many B-pictures are used, then the prediction distance increases resulting 
in lesser temporal correlation, and we have longer coding delays. 


The compression mode for each MB in a B-picture is selected from the list of 
allowable modes shown in Table 8.2. Again, “intra” and “intra-A” MBs are allowed. 
MBs classified as “inter” have the following options: “D” indicates the prediction 
error will be coded, “F” indicates forward prediction with motion compensation, 
“B” indicates backward prediction with motion compensation, “I” indicates inter- 
polated prediction with motion compensation, and “A” indicates adaptive quanti- 
zation. A macroblock may be “skipped” if the co-located block from the reference 
frame is good enough as is; i.e., no information needs to be sent. 


Quantization and Coding 


In the inter-frame mode, the inputs to the DCT are in the range [-255,255]; thus, 
all DCT coefficients have the dynamic range [-2048,2047]. The quantization matrix 
is such that the effective quantization is relatively coarser compared to those matrices 
used for I-pictures. All quantized DCT coefficients, including the DC coefficient, 
are zigzag scanned to form [run, level] pairs, which are then coded using VLC. Dis- 
placement vectors are DPCM encoded with respect to the motion vectors of the 
previous blocks. VLC tables are specified for the type of MB, the differential motion 
vector, and the MCP error. Different Huffman tables are defined for encoding the 
macroblock types for P- and B-pictures, whereas the tables for motion vectors and 
the DCT coefficients are the same for both picture types. 
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MPEG-1 Encoding 


An MPEG-1 encoder includes modules for motion estimation, selection of com- 
pression mode (MTYPE), and setting the value of MQUANT at each MB, as well 
as MCP, quantizer and dequantizer, DCT and inverse DCT (IDCT), VLC, multi- 
plexer, and a buffer regulator. The encoder must duplicate the decoder-processing 
loop so that it produces the same MC predictions as the decoder, which are obvi- 
ously based on decoded previous frames. The IDCT module at the encoder should 
match within a prespecified tolerance, specified in the IEEE Standard 1180-1990 
for 64-bit floating-point IDCT, the IDCT module at the decoder to avoid propaga- 
tion of errors in the prediction process. 

The number of I-, P- or B-pictures in a GOP (i.e., value of N and M) is 
application-dependent. It is specified that 1 out of every 132 pictures must be an 
I-picture to avoid error propagation due to IDCT mismatch between the encoder 
and decoder. Motion vectors are represented with one-half (0.5) pixel accuracy. The 
maximum length of the vectors that may be represented can be changed on a picture- 
by-picture basis to allow flexibility. Motion vectors that refer to pixels outside the 
picture are not allowed. Neither the motion-estimation algorithm nor the criterion to 
select MYTPE and MQUANT are part of the standard. The MPEG committee has 
developed a simulation model encoder, called SM3, in order to verify the standard. 
We elaborate on some of the non-normative choices made in SM3 in the following: 


1. Common choices for the number of B-frames in between two anchor frames 
are M = 1 or M =2. Of course, the use of B-frames is optional in MPEG-1. 

2. Motion vectors are estimated for each MB. SM3 employs logarithmic search 
and telescopic search methods to first find the best-integer (full-pixel) motion 
vector. Telescopic search is a method for reducing the search space when motion 
estimation is carried over multiple frames with B-frames present. To avoid large 
search ranges, telescopic search cascades best-motion vector estimates obtained 
frame-to-frame. That is, first motion vector from the current picture to the 
previous is searched centered about zero vector, then the best estimate from 
the previous picture to the next previous picture is searched centered about the 
best estimate from the previous step, and so on. Finally, a half-pixel update is 
performed about the final full-pixel motion estimate. 

3. SM3 determines the compression mode MTYPE for each MB depending on 
the picture type of the current picture from Table 8.2. There are two approaches: 
i) An exhaustive method tries coding each MB using each allowed type, then 
chooses the one that yields the least number of bits. ii) A faster method makes a 
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series of decisions sequentially. In SM3, these decisions are ordered, in the case 
of P-pictures, as: 


a. Decide whether a motion vector should be transmitted (MC) or not (No 
MC). 

b. If no MC is selected, decide whether intra-mode or inter-mode with no 
motion vector will be used. 

c. Ifinter-mode is selected, decide if the residual error is large enough to be 
coded (Coded) or not (Not Coded). 

d. Decide if the quantizer scale is satisfactory or needs to be changed. 

4. Rate control is achieved by maintaining a parametric model of the decoder, 
known as a video buffer verifier (VBV). The encoder monitors the status of the 
model buffer to update the allocation of bits to each picture type. The larger the 
buffer, the greater the flexibility of the encoder at the expense of larger decoding 
delay. The value of MQUANT is updated based on normalized spatial activity 
of the MB. 


8.2.3 MPEG-2 Standard 


The quality of MPEG-1 compressed video at 1.2 Mbps is unacceptable for most 
entertainment applications. Subjective tests have indicated that ITU-R 601 video 
(four times MPEG-1 SIF) can be compressed with excellent quality at 4-6 Mbps. 
MPEG-2 is a compatible extension of MPEG-1 for a wide range of applications 
at various bit-rates (4-50 Mbps) and resolutions. The main parts of the MPEG-2 
standard are Systems (ISO/IEC 13818-1), Video (ISO/IEC 13818-2), Audio 
(ISO/TEC 13818-3), AAC Audio (ISO/IEC 13818-7), and DSM-CC (ISO/IEC 
13818-6). Here, we focus on MPEG-2 video only. MPEG-2 video began as a com- 
mittee draft in November 1993, and was formally approved as an international 
standard in 1994. It has achieved mass-market acceptance in digital TV broadcast 
and DVD. 

The main new features of MPEG-2 video are: i) It allows efficient coding of 
interlaced video by introducing field pictures, frame/field adaptive MC, dualprime 
MC, 16 X 8 MC, frame/field adaptive DCT, and alternative DCT coefficient scan- 
ning options. ii) It enables higher-quality video compression (at higher bit-rates) by 
allowing higher-definition inputs, alternative sub-sampling of the chroma channels, 
and improved quantization and VLC coding options. iii) It offers a scalable bit- 
stream syntax option to allow for scalability and data partitioning. iv) Subsets of the 
full syntax have been specified under a number of “profiles.” Furthermore, a number 
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of “levels” have been introduced within these profiles to impose constraints on some 
of the video parameters. In the following, we discuss the input-video formats and 
data structure of MPEG-2, how MPEG-2 handles interlaced video, extensions for 
encoding of higher-definition video, and a brief overview of the profiles and levels. 


More details can be found in [Gal 92] and [Chi 95]. 


Input-Video Formats and Data Structure 


MPEG -2 video can take both interlaced (e.g., ITU-R 601 525/30 and 625/25) and 
progressive (e.g., SIF 525/30 and 625/25) inputs. MPEG-2 also allows for interlaced 
display of progressive coded video (e.g., progressive film source can be coded at 24 
frames/s and displayed in interlaced format using 3:2 pull-down). Since it allows for 
4:2:0, 4:2:2 (chroma sub-sampled only horizontally), and 4:4:4 (no sub-sampling) 
chroma formats, an MB in MPEG-2 may contain 6 (4 luma, 1 Cr, and 1 Cb), 8 
(4 luma, 2 Cr, and 2 Cb), or 12 (4 luma, 4 Cr, and 4 Cb) 8 X 8 blocks. The spatial 
locations of luma and chroma pixels for the 4:2:0 and 4:2:2 formats are depicted in 
Figure 8.6. 

Interlaced video is composed of a sequence of top and bottom (or even and odd) 
fields separated by a field period. Either the top or bottom field can be designated 
as the temporally first field. If the input is interlaced, the output bit-stream must 
consist of a sequence of fields that are separated by the field period. MPEG-2 defines 
two new picture types to effectively deal with interlaced video: 


1. Frame pictures, which are obtained by interleaving lines of even and odd fields 
to form composite frames. Frame pictures can be I-, P-, or B-type. An MB of 
the luminance frame picture is shown in Figure 8.9(a). 

2. Field pictures are simply the even and odd fields treated as separate pictures. 
Each field picture can be the I-, P- or B-type. 
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Figure 8.6 Location of luminance and chrominance pixels for (a) 4:2:0 and (b) 4:2:2 frame pictures. 
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Picture types in MPEG-2 
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Figure 8.7 Summary of picture types in MPEG-2. 
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Figure 8.8 GOP for an interlaced video. 


A summary of picture types for MPEG-2 is shown in Figure 8.7. A group of 
pictures can be composed of an arbitrary mixture of field and frame pictures. Field 
pictures always appear in pairs (called the top field and bottom field), which together 
constitute a frame. If the top field is a P- (B-)picture, the bottom field must also be 
a P- (B-)picture. If the top field is an I-picture, the bottom field can be an I- or a 
P-picture. A pair of field pictures is encoded in the order that it should appear at the 
output. An example of a GOP for an interlaced video is shown in Figure 8.8. On the 
other hand, in progressive video all pictures are frame pictures. 


Interlaced-Video Compression 


There are two options in coding interlaced video: i) every field can be encoded inde- 
pendently (field pictures), or ii) two fields may be encoded together as a composite 
frame (frame pictures). It is possible to switch between frame and field picture modes 
on a frame-to-frame basis. The main drawback of processing interlaced video in the 
form of frame pictures is that since alternate scan lines come from different fields, 
motion during the field period causes misalignment of spatial structure resulting in 
artificial vertical high-frequency content (or equivalently loss of vertical correlation). 
Thus, frame encoding may be preferred for relatively still images, and field encoding 
may give better results when there is significant motion. In order to deal with inter- 
laced video effectively, MPEG-2 offers new options for the computation and coding 
of DCT coefficients and motion-compensation such as: 
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。 New MC prediction (MCP) modes for interlaced video 
。 Field/frame DCT option per MB for frame pictures 
。 Alternate scan for ordering DCT coefficients 


MC Prediction Modes for Interlaced Video 


In order to deal with interlaced video effectively, five types of MCP are provided in 
MPEG-2: frame prediction for frame pictures, field prediction for field pictures, field 
prediction for frame pictures, 16 X 8 prediction for field pictures, and dual-prime pre- 
diction for P-pictures. The best prediction mode is usually frame prediction for MBs in 
stationary (no motion) regions and field prediction for MBs in moving regions, since 
in the presence of motion, frame prediction suffers from strong motion artifacts, while 
in the absence of motion, field prediction does not utilize all available information. 

Frame prediction for frame pictures covers all MC modes that were discussed in 
MPEG-1 including P- and B-modes. In field prediction, each field is predicted inde- 
pendently using data from one or more previously decoded fields. Within a field pic- 
ture only field prediction can be used. However, in a frame picture either field or frame 
prediction may be employed on an MB-by-MB basis. Field prediction for field pictures 
is similar to frame prediction except that both target MB and reference MB consist of 
pixels from one field. The parity of the fields for target and reference MBs may or may 
not be the same. To perform field prediction for frame pictures, the target MB is first 
split into the top and bottom fields. Field prediction is then performed independently 
for each of the two 16 X 8 parts. Thus, in this mode, two motion vectors are needed 
for P-pictures and two or four motion vectors are needed for B-pictures. 

There are also two other prediction modes: 16 X 8 MC mode for field pictures 
and dual-prime mode for P-pictures. In 16 X 8 MC for field picture mode, each MB 
is split into an upper half and lower half, each of which is 16 X 8. They are motion 
compensated independently. Thus, two motion vectors are needed per MB, one for 
the upper and the other for the lower part, in P-pictures. In the case of bi-directional 
prediction, four motion vectors are needed. In dual-prime mode, one motion vector 
and a small differential vector are encoded for each MB. In the case of field pictures, 
two motion vectors are derived from this information and used to form predictions 
from two reference fields, which are averaged to form the final prediction. Dual- 
prime mode is used only for P-pictures [ISO 13]. 


Field or Frame DCT Option for Each MB in a Frame Picture 


Prior to the computation of the DCT, the encoder may reorder the luminance lines 
in an MB such that the first eight lines come from the top field and the last eight 
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Figure 8.9 DCT for interlaced frame pictures: (a) frame DCT and (b) field DCT. 


come from the bottom field. This allows computing DCT on a field-by-field basis 
for specific parts of a frame picture. For example, field DCT may be chosen for 
macroblocks containing high motion, whereas frame DCT may be appropriate for 
macroblocks with little or no motion but containing high spatial activity. The inter- 
nal organization of an MB for frame (a) and field (b) DCT is shown in Figure 8.9. 
Note that in 4:2:0 sampling only frame DCT can be used for the chroma blocks. 


Alternate Scan and Field/Frame DCT for Frame Pictures 


One way to deal with the reduced vertical resolution in frame pictures originating 
from interlaced video is to use a scan that favors vertical frequencies over horizontal 
frequencies. Therefore, MPEG-2 allows for an optional scanning pattern, called 
the “alternate scan,” which is depicted in Figure 8.10, in addition to the zigzag 
scanning. 


Other Tools and Improvements 


MPEG-2 features some extensions in the quantization and coding options for 
improved image quality in exchange for higher bit-rates. In particular, it allows for 
i) finer quantization of the DCT coefficients, ii) finer adjustment of the quantizer 
scale factor, and iii) a separate VLC table for the DCT coefficients for the intra- 
macroblocks, some of which are detailed in the following. 


Finer Quantization of the DCT Coefficients 


In intra-macroblocks, the quantization weight for the DC coefficient can be 8, 4, 
2, or 1, i.e., 11 bits (full) resolution is allowed for the DC coefficient. Recall that 
this weight is fixed to 8 in MPEG-1. AC coefficients are quantized in the range 
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Figure 8.10 Alternate scan. 


Table 8.3 Optional Set of MQUANT Values 


0.5 3.5 9.0 18.0 32.0 56.0 


1.0 4.0 10.0 20.0 36.0 
1.5 5.0 11.0 22.0 40.0 
2.0 6.0 12.0 24.0 44.0 
2.5 7.0 14.0 26.0 48.0 
3.0 8.0 16.0 28.0 52.0 


[-2048, 2047] in MPEG-2, as opposed to [-256, 255] in MPEG-1. In non-intra- 
macroblocks, all coefficients are quantized into the range [-2048, 2047]. This range 
was [-256, 255] in MPEG-1. 


Finer Adjustment of MQUANT 


In addition to a set of MQUANT values that are integers between 1 and 31 (also 
known as linear quantization), MPEG-2 allows for an optional set of 31 values rang- 
ing from 0.5 to 56 (also referred as nonlinear quantization), which are listed in 
Table 8.3. This optional set provides higher accuracy for small coefficient values. 


Data Partitioning 


MPEG-2 supports partitioning of a single-layer coded bit-stream into two or 
more layers, such that some layers are assigned a higher priority (hence, quality- 
of-service parameters) for reliable transmission over heterogeneous networks. 
Data partitioning may be considered as an elementary form of scalable video 
representation. 
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Table 8.4 Parameter Constraints According to Levels 





Level Max. Pixels Max. Lines Max. Frames 
LOW 352 288 30 
MAIN 720 576 30 
HIGH-1440 1440 1152 60 
HIGH 1920 1152 60 

Profiles and Levels 


MPEG-2 full syntax covers a wide range of features and free parameters. Consider- 
ing practical difficulties with the hardware implementation of the full syntax, there 
are six MPEG-2 profiles that define subsets of the syntax and four levels that impose 
constraints on the values of the free parameters. The profiles are the Simple, Main, 
SNR, Spatial, High, and 4:2:2 profiles. The parameter constraints imposed by the 
four levels are summarized in Table 8.4. 

Simple, Main, SNR, Spatial, and High profiles are designed in an “onion-ring” 
structure, i.e., the High profile supports all tools supported by the previous four 
profiles and some new ones, the Spatial profile supports all tools supported by the 
previous three and new ones, and so on. The Simple profile does not allow B-pictures 
and supports only the Main level. The Main profile does not include any scalability 
tools (see Section 8.5) and supports all four levels with upper bounds on the bit-rates 
equal to 4, 15, 60, and 80 Mbps for the Low, Main, High-1440, and High levels, 
respectively. The High profile is a superset of the Spatial profile such that it also sup- 
ports 4:2:2 video. The 4:2:2 profile addresses professional digital-video applications, 
which requires 4:2:2 chroma sampling but not scalability. 

The Main profile at the Main level (MP@ML), which is used for standard- 
resolution digital TV, has been by far the most widely adopted MPEG-2 profile. 
The 4:2:2 profile at the Main level is required to decode all bit-streams decodable by 
MP@ML decoders. 


MPEG-2 Encoding 


MPEG-2 encoding is more complex than MPEG-1 encoding since the encoder 
needs to consider several more encoding mode choices to obtain the best results. We 
discuss some non-normative elements of the MPEG-2 Test Model 5 (TMS) encoder 
developed by the MPEG group in order to verify the standard. TM5 estimates two 
types of motion vectors (MVs), frame MVs and field MVs, for each MB of P- and 


B-frame pictures. For B-frames, two frame-motion vectors (one forward and one 
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backward) are estimated for each 16 X 16 MB. In addition, four field motion vectors 
(two for each direction) are estimated, where each MV corresponds to a 16 X 8 
luminance region. For P-frames, only forward frame- and field-motion vectors (total 
of 3) are estimated. Field MVs are estimated to minimize the sum of absolute errors 
in the respective fields, where the frame vectors minimize the sum of the errors in 
the two fields. 

In order to decide for the best compression mode, the following procedures are 
performed at each MB: For P-pictures, TM5 first decides between a frame MV vs. 
two field MVs based on comparison of the respective sum of absolute errors. For 
B-pictures, there are a total of six possible combinations, which are combinations of 
forward/backward/interpolated and frame/field MVs, at each MB. TM5 computes 
the sum of absolute errors for each combination and chooses the one that yields 
the minimum value. Then, a decision must be made between transmission of the 
selected motion vector(s) (MC) vs. not transmitting any MV (no MC). This is also 
based on comparison of the respective sum of absolute errors. Next, a decision is 
made between intra- vs. inter-coding. This decision criterion, based on compari- 
son of variances of the block difference and MC block difference, is the same as in 
MPEG-1 SM3. After the intra/inter decision, TM5 decides for frame vs. field DCT. 
To this effect, each 16 X 16 luminance MB is rearranged as four 8 X 8 field blocks, 
where each block contains pixels only from a single field. TM5 then computes the 
vertical correlation of the original (frame) and rearranged (field) block configura- 
tions and chooses the one that gives higher correlation as the DCT type. 

An MPEG-video decoder de-multiplexes the incoming video bit-stream (with 
a standard syntax) into image data and side information such as MTYPE, motion 
vectors, MQUANT, and so on. ‘The inverses of the encoder operations are performed 
on the image data. The decoder must employ at least two frame-stores, since two 
reference frames are needed to decode B-pictures. 


8.3 MPEG-4 AVC/ITU-T H.264 Standard 


ITU-T H.264/MPEG-4 Part 10 Advanced Video Coding (AVC) is a state-of-the-art 
video coding standard developed by the Joint Video Team (JVT) consisting of experts 
from ITU-T VCEG and ISO/IEC MPEG [Wie 03, Sul 05]. It improves the coding 
efficiency (on average by a factor of two) over MPEG-2 at the cost of some increase 
in complexity. Fidelity range extensions (FRExt) were added as an amendment to the 
standard in July 2004 to support the requirements of professional high-fidelity video 
applications. FRExt has provided further coding efficiency improvements (up to a 
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factor of 3) over MPEG-2 for high-fidelity video. H.264/AVC introduces several 
innovative concepts and tools, which are discussed in the following. 

The original version of the standard specifies three profiles: Baseline (low com- 
plexity and robustness), Main (high compression efficiency), and Extended (high 
compression efficiency and robustness). A family of High profiles, which consists of 
the High profile (HP), High 10 profile (Hil0P), High 4:2:2 profile (Hi422P), and 
High 4:4:4 profile (Hi444P), were added with the FRExt amendment and support 
adaptive transform block size and perceptual quantization scaling matrices in addi- 
tion to all features of the previous Main profile. The Hi444P also supports an integer 
residual color transform for coding RGB video [Sul 05]. 


8.3.1 Input-Video Formats and Data Structure 


The Baseline profile accepts progressive video with 4:2:0 color sampling and 8-bit 
per sample, per component accuracy. The Main, Extended, and High profiles accept 
progressive or interlaced video with 4:2:0 color and 8-bit per sample/component 
accuracy. Hil0P allows for 4:2:0 color with a 10-bit per sample/component. Hi422P 
allows for 4:2:2 color with 10-bit per sample/component, while Hi444P allows for 
4:4:4 color with 12-bit per sample/component sample accuracy. 

The data structure of H.264/AVC is organized as a sequence of pictures, which 
consist of slices, which in turn consist of macroblocks, that can be expressed as 


Sequence(pictures(slices(macroblocks))) 


Each sequence starts with an instantaneous decoding refresh (IDR) access unit; i.e., a 
picture that can be decoded without decoding any previous pictures. An IDR picture 
indicates that no subsequent picture in the stream will require reference to pictures 
prior to the IDR picture. 


Macroblocks and Slices 


A macroblock (MB) is defined as a 16 X 16 luma block and the corresponding chroma 
blocks as usual. MBs are grouped into slices. A picture may contain one or more slices 
that are independently decodable. Slices i) help generation of payload packets that can 
fit the maximum-transfer unit (MTU) of a network so that each MTU carries an inte- 
ger number of slices, ii) help error resilience in lossy transmission environments so that 
a lost packet affects only a limited region of a picture, and iii) enable parallel encoding 
and decoding since each slice can be independently encoded/decoded. Picture types 
are not used in H.264/AVC. Instead, there are five slice types: I-slice (all MBs are 
coded by using intra-prediction only), P-slice (up to one motion vector per sub-block), 
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B-slice (up to two motion vectors per sub-block), switching P (SP) slice, and switching 
I (SI) slice. That is, a picture may contain slices of different types. SP and SI slices are 
included in the Extended profile to facilitate switching between different bit-streams 
representing the same video encoded at multiple quality levels (multiple bit-rates). 


Temporal-Prediction Structures and Processing Order 


The classical temporal-prediction structures are “IPPPPP...” picture ordering with 
sequential encoding/decoding or “IBBBPBBBP...” picture display ordering with 
the corresponding encoding/decoding order shown in the Example in Section 8.2.2. 
H.264/AVC allows for more flexible slice dependency and temporal-prediction struc- 
tures. For example, a picture can be marked as a reference picture regardless of the 
coding types of its slices, stored in the decoded picture buffer (DPB) that can hold 
up to 16 pictures, and used for MCP of future pictures before the next IDR picture. 

Experiments have indicated that hierarchical prediction structures that enable 
multi-level temporal scalability of the bit-stream also increase the coding efficiency. 
An example of a hierarchical prediction structure with four dyadic levels is depicted 
in Figure 8.11, where pictures at the highest level are called key pictures. Key pictures 
are either intra-coded (to enable random access) or inter-coded using previous key 
pictures as reference for MCP. In the example, the first picture with the picture order 
count (POC) 0 and the picture with POC 8 are key pictures. Only the first picture is 
intra-coded as an IDR picture. The remaining pictures are hierarchically predicted as 
illustrated, such that the second-level picture with POC 4 is predicted from the two 
key pictures, and the third-level pictures with POC 2 and 6 are predicted as shown 
by the arrows. The fourth-level pictures are marked “b,” which implies they are not 
a reference for other pictures. 


8.3.2 Intra-Prediction 


In MPEG 1/2, similar to JPEG, intra-prediction has been limited to prediction of 
the DC coefficients across blocks in the frequency domain. H.264/AVC employs 
a novel approach to block-based intra-prediction in the spatial domain. All slice 
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Figure 8.11 Hierarchical B-pictures. POC denotes the display order. 
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Figure 8.12 Illustration of 4 x 4 intra-prediction: (a) 4 x 4 block and boundary pixels; 
(b) prediction directions. 


types in the Baseline, Main, and Extended profiles support two types of luma intra- 
prediction: 4 X 4 and 16 X 16 intra-prediction. In addition, High profiles (FRExt) 
also support 8 X 8 intra-prediction. The chroma samples are predicted by using the 
modes in the 16 X 16 intra-prediction. 

In 4X4 prediction, 16 samples of a 4X 4 block denoted by small letters a 
through p are predicted from border samples of previously encoded MBs marked by 
capital letters A through Q in Figure 8.12(a). The encoder can select either the DC 
mode (mode 2), where all samples a through p are predicted by an average value, or 
one of modes 0—1,3—8, which correspond to prediction in the directions indicated by 
the arrows in Figure 8.12(b), e.g., in mode 0, a=e=i=m=A, b=f=j=n=B, 
c=g=k=0=C, and d=h=1=p=D,; in mode 1, a=b=c=d=I, e=f= 
g=h=J,i=j=k=l1=K, andm=n=o0=p=L. 

In 16 X 16 prediction, only four modes are supported: DC, horizontal, vertical, 
and planar. ‘The first three modes are the same as in 4 X 4 prediction, except they 
extend to the entire 16 X 16 block. Planar mode predicts the current block by a 
plane-fit approximation to model the horizontal and vertical variation in the intensi- 
ties of the border pixels of neighbor blocks. 


8.3.3 Motion Compensation 


Over the years, significant gains in compression ratios have been achieved by advances 
made in the motion-compensated prediction (MCP) methods. In MPEG 1/2, we 
have P- and B-MCP modes with 16 X 16 fixed block size and half-pixel accuracy for 
motion vectors (MV). H.264/AVC brings several innovations for improved MCP. 
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They include: quarter-pixel accuracy (1/8 pixel accuracy for chroma) MCP, reference 
picture extrapolation to handle MVs extending outside picture boundaries, variable 
size blocks for MCP, multi-picture MCP, and multi-hypothesis and weighted MCP, 
which are explained below. 


Motion-Vector Precision and Encoding 


The precision of MVs is 1/4 of the distance between luma pixels. The correspond- 
ing sub-pixel sample values are computed by interpolation. Half-pixel sample values 
are evaluated by using a separable 6-tap FIR filter horizontally and vertically, while 
quarter-pixel sample values are computed by bi-linear interpolation between full- 
and computed half-pixel samples. Chroma samples are always computed by bi-linear 
interpolation. If the MVs point to outside of reference picture boundaries, the refer- 
ence picture is extrapolated by replicating border pixels. The components of the MV 
are encoded predictively using either median or directional prediction from the MVs 
of neighboring blocks. No prediction takes place across slice boundaries. 


Variable-Block-Size MCP 


Each luma MB in P- and B-slices can be partitioned into blocks of different sizes and 
shapes for MCP. H.264/AVC syntax supports 16 X 16, 16 X 8, 8 X 16, and 8 X 8 
blocks as shown in Figure 8.13. In addition, each 8 X 8 block can be further sub- 
divided into 8 X 4, 4 X 8, or 4 X 4 sub-blocks. Thus, a maximum of 16 MVs may 
need to be transmitted for each MB in P-mode MCP. 


Multi-Picture MCP 


Multi-picture MCP refers to using more than one previously coded picture as uni- 
directional references even in P-slices, as depicted in Figure 8.14. It requires both the 





Figure 8.13 Sub-blocks for motion compensation. 
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Figure 8.14 Illustration of multiple-picture motion-compensated P-prediction. 


encoder and decoder to store the same pictures in their multi-picture buffers. This is 
achieved by memory-management control-operations (MMCO) messages that are 
sent in the bit-stream, where the encoder signals the index of the reference picture 
for each 16 X 16, 16 X 8, 8 X 16, and 8 X 8 block. Sub-blocks smaller than 8 X 8 use 
the same reference picture. 


Multi-Hypothesis MCP, Weighted MCP, and B-Slices 


With the introduction of multi-picture MCP and allowing B-slices to be used as 
reference for other slices, the main difference between P-slices and B-slices in H.264/ 
AVC becomes that B-slices allow weighted bi-prediction. In previous standards, bi- 
directional weighted prediction was performed by a simple (1/2, 1/2) averaging fil- 
ter. In H.264/AVC, the encoder can specify weights and offsets to be used in each 
P- and B-macroblock of a slice. The encoder is allowed to specify different weights 
and offsets within the same slice. This can be especially effective in encoding “cross 
fades” as B-slices allow weighted blending between pictures from two scenes (shots). 
Offsets allow multi-hypothesis MCP, i.e., linear superposition of MCPs, such as 
in overlapped block-motion compensation. Indeed, combining multi-picture MCP, 
multi-hypothesis MCP, and weighted MCP offers a very powerful unified generaliza- 
tion of all known MCP concepts. 

B-slices also allow a “direct mode” where MVs for a macroblock are not explic- 
itly sent. The decoder derives MVs by scaling MVs of a co-located macroblock in 
another picture, which uses the same reference picture, according to time differences 
between the three pictures. 


8.3.4 Transform 


H.264/AVC uses 4X 4 transform blocks and a separable integer transform that 
closely approximates DCT instead of the DCT. The coefficients of 4 X 4 DCT vs. 
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Figure 8.15 Comparison of 4 X 4 transforms: (a) DCT and (b) integer transform in H.264/AVC. 


the integer transform are shown in Figure 8.15. FRExt also supports an 8 X 8 integer 
transform. Since the inverse transform is an integer transform, rounding errors do 
not occur. 

We note that since H.264/AVC has improved prediction modes, the intra- and 
inter-prediction error (residual) has less spatial correlation; hence, the 4 X 4 trans- 
form is as efficient as the usual larger 8 X 8 transform in removing correlations. The 
smaller transform has the following benefits: i) it visually results in less mosquito 
noise and ringing artifacts around edges, ii) it can be implemented by using only 
adds and shifts and, hence, avoids the encoder-decoder mismatch problem, and 
iii) it requires only 16-bits wordlength for all operations including scaling. 

The DC coefficients of the luma intra-16 X 16 mode and all chroma modes 
undergo a second transformation. This Haar transformation is 4 X 4 for the intra- 
16 X 16 mode and 2 X 2 for all others. The second transformation aims to exploit 
the remaining redundancy between DC coefficients of the 4 4 blocks, which 
proves useful in relatively flat image areas. 


8.3.5 Other Tools and Improvements 
Quantization 


A uniform quantizer whose step-size is controlled by a quantization parameter (QP) 
that can take on 52 different values is used. The quantization step-size changes loga- 
rithmically by QP, and an increase by one in QP corresponds to an approximately 
12% increase in the step-size. 


Entropy Coding 


‘There are two entropy coders in H.264/AVC: context-adaptive variable-length cod- 
ing (CAVLC) and context-adaptive binary arithmetic coding (CABAC). CABAC 
has higher complexity and is supported in Main and High profiles only. 

In CAVLC, the quantized transform coefficients and motion vectors are coded 
using VLC tables that are context conditional, i.e., switched according to the values 
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of the previous syntax elements. Different context models are used for motion vec- 
tors and transform coefficients. Instead of designing a different VLC table for each 
context, only the mapping from a single universal code table, refered to as Exp— 
Golomb code, to each context is performed. 

In CABAC, after context estimation, symbols are mapped into a sequence 
of binary decisions, where each decision is encoded by binary arithmetic coding. 
CABAC not only uses context-conditional probability estimates, but also encodes 
each symbol with a non-integer number of bits, as we have learned when studying 
arithmetic coding. Compared to CAVLC, CABAC improves coding efficiency by 
10% to 12% at the expense of increased computational complexity. 


In-Loop De-Blocking Filter 


Various de-blocking filters have been developed for post-processing of decompressed 
frames to reduce blocking artifacts resulting from processing each block indepen- 
dently in JPEG and MPEG-1/MPEG-2. In H.264/AVC, the de-blocking filter is used 
in the encoding-decoding loop; hence, it is a normative part of the standard. The filter 
strength can be selected 4 priori or determined by the encoder according to coding 
modes of adjacent blocks, quantization step-size, and steepness of the luminance gra- 
dient between blocks. The filter operates on the edges of each 4 X 4 or 8 X 8 transform 
block. It can modify up to three pixels on either side of a given block edge depending 
on the filter-strength value. The filter can also be turned off by the encoder. 


Error Resilience Tools 


Partitioning pictures into slices helps localize the adverse effects of network packet 
losses since the start of each slice is a resynchronization point at the decoder. In 
addition, H.264/AVC provides special tools to enhance error resilience of the bit- 
stream, including data partitioning, flexible macroblock ordering (FMO), arbitrary 
slice ordering (ASO), and redundant slices. Data partitioning refers to separating 
more important data such as MB types and MVs from less important data such as 
transform coefficient values and sending them in separate packets. FMO refers to 
non-sequential mapping of MBs to different slices, such as by a checkerboard pat- 
tern. ASO refers to non-sequential ordering of slices of a picture in the bit-stream. 
Redundant slices allow sending duplicative coded representations of some or all parts 
of a picture. 

The coded data is organized into network-access layer (NAL) units (also called 
packets) containing an integer number of bytes. The first byte of each NAL unit is 
a header, and the remaining bytes are the payload data of the type indicated by the 
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header. When streaming over the Internet, the NAL units are encapsulated by under- 
lying transport protocol packets. 

An open-source implementation of the AVC encoder and decoder, known 
as x264 library, was released under the GNU GPL license and is available at: 
http://www.videolan.org/developers/x264.html. 


8.4 High-Efficiency Video-Coding (HEVC) Standard 


The high-efficiency video-coding standard (HEVC) is the most recent video-com- 
pression standard jointly developed by ITU-T VCEG and ISO/IEC MPEG and its 
first version was published in 2013 [Sul 12]. The ITU term for HEVC is H.265, 
while the MPEG name is MPEG-H Part 2. It is designed to provide an average of 
50% improvement in bit-rate for ultra-high-definition video compared to H.264/ 
AVC at the same visual quality by aggregating a number of small improvements. 
The basic HEVC design is still based on the classical MC transform coding, and 
also includes a video-coding layer (VCL) that refers to actual video, and a network- 
abstraction layer (NAL) that refers to transport interface aspects. The main technical 
novelties are replacing the concept of MB with coding-tree units (CTU) and intro- 
ducing tools for parallel video encoding/decoding. Scalable coding and 3D-video 
extensions of the standard are in progress. 


8.4.1 Video-Input Format and Data Structure 


HEVC assumes the input is progressive video since most cameras and displays are 
now progressive, and interlaced video is becoming less common. That is, HEVC 
includes no explicit tools for efficient encoding of interlaced video in order not to 
burden decoders with additional complexity, i.e., no more macroblock-adaptive 
frame-field (MBAFF) coding. However, metadata syntax is provided to signal an 
interlaced-video input where each coded picture can be a separate field or a compos- 
ite interlaced frame. 

HEVC high-level syntax includes new features for random access and bit-stream 
splicing. In H.264/AVC, a conforming bit-stream must start with an instantaneous 
decoder refresh (IDR) picture that defines a closed GOP. HEVC introduces a dis- 
tinct NAL unit type to signal a clean random-access (CRA) picture, which enables 
an open GOP. That is, an HEVC-conforming bit-stream may start with an IDR 
or a CRA picture. In a closed GOP, all pictures are decodable without referencing 
pictures from other GOPs. In an open GOP, some pictures can depend on pictures 
preceeding the CRA picture. Pictures preceeding a CRA picture in display order, but 
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appearing after the CRA picture in decoding order, are called leading pictures. If a 
leading picture references a picture preceeding the CRA picture, it should be skipped 
in a decoding process that starts from the CRA picture, and it is called a random 
access skipped leading (RASL) picture. If a leading picture does not contain any refer- 
ence to pictures preceeding the CRA picture, it is called a random-access decodable 
leading (RADL) picture. Thus, CRA pictures provide random-access points (RAP) 
without flushing the decoded picture buffer (DPB); i.e., without breaking temporal 
dependencies. Open GOPs are sometimes preferred because they generally provide 
better compression efficiency compared with closed GOPs. A broken-link access 
(BLA) picture is a special CRA picture used to signal a bit-stream splicing point. To 
summarize, in HEVC, random access is provided at IDR, CRA, and BLA pictures. 


8.4.2 Coding-Tree Units 


In HEVC, the concept of macroblock, which was the basic coding unit in the previous 
standards, was replaced by coding-tree units (CTU), such that each slice of a picture is 
partitioned into CT Us consisting of a coding-tree block (CTB) of luma pixels and two 
coding-tree blocks of corresponding chroma-pixels. CTUs can be 64 X 64, 32 X 32, 
or 16 X 16. The width and height of a CTU are signaled in a sequence parameter 
set, so that all CTUs in a video sequence have the same size. With the introduction 
of ultra-high-definition video formats, we see that larger block sizes for MCP and/or 
transform provide higher compression efficiency. In Class A test sequences with resolu- 
tion 2560 X 1600, it was found that the bit rate increases by 5.7% when forced to use 
32 X 32 CTU and increases by 28.2% when forced to use 16 X 16 CTU compared to 
64 X 64 CTU [Ohm 12]. Large CTU sizes also reduce the decoding time. 

Each CTB can be differently split into flexible-square coding blocks (CB). HEVC 
supports CB sizes from the same size as the CTB to as small as 8 X 8. The partition- 
ing of each CTB into CBs is conveyed using hierarchical quad-tree syntax. Parti- 
tioning of a 64 X 64 CTB into CBs and its quad-tree representation are illustrated 
in Figure 8.16, where some CBs are 32 X 32 and others are 16 X 16 and 8X8. A 
coding unit (CU) consists of a Y and associated Cr and Cb CBs with their syntax 
elements. The prediction-type decisions are made and signaled at the CU level. 

Each CB can be further split into prediction blocks (PB) depending on their 
temporal and/or spatial predictability. In intra-prediction, the PB size is set the same 
as the CB size except for the smallest CB size in the bit-stream, which can be further 
split into four quadrants. As a result, different intra-prediction modes can be selected 
for PBs as small as 4 X 4. In inter-prediction, luma and chroma CBs can be further 
split into two or four PBs. Splitting into four is allowed only with the smallest CB 
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Figure 8.16 Partitioning a coding-tree block (CTB) into coding blocks (CB): (a) an example 
partition and (b) associated quad-tree representation. 


Figure 8.17 Partitioning of coding blocks (CB) into prediction blocks. 


size. Partitioning of a CB into PBs is illustrated in Figure 8.17, where the asymmetric 
splits with a ratio of 1/4 and 3/4, depicted in the lower row, are only allowed for CBs 
16 X 16 or larger. Each PB can be assigned one or two MVs. To avoid a large number 
of MVs, 4 X 4 luma PBs are not allowed in inter-prediction, and 4 X 8 and 8 X 4 
luma PBs can only be used in unidirectional inter-prediction. 

Independent of partitioning into PBs, each CB can also be split into transform 
blocks (TB) whose boundaries need not be aligned with PB boundaries. That is, it is 
possible to perform a single transform across residuals from multiple PBs for inter- 
prediction CUs. Only square TBs can be specified and can be as small as 4 X 4. The 
TB partitioning information is signaled by a residual quad-tree. 


8.4.3 Tools for Parallel Encoding/Decoding 


HEVC supports special tools, such as tiles and wavefront parallel processing, that are 
designed to enable decoding a single picture with multiple threads. 


Tiles 


Tiles are independently decodable rectangular regions of an image, similar to 
those defined in the JPEG2000 standard. Tiles should be at least 256 X 64 luma 
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Figure 8.18 Tools for parallel encoding/decoding: (a) tiles, slices, and slice segments and 
(b) wavefront parallel processing. 


pixels. They enable parallelism at the subpicture level with no need for synchro- 
nization between threads. They also facilitate region-of-interest (ROI) decoding. 
Multiple tiles may be contained within a single slice sharing a common header, or 
alternatively a tile may contain multiple slices, including dependent slice segments 
[Mis 13]. 

Dependent slice segments are defined to adjust the granurality of packetization 
to assist low-delay encoding. With dependent slice segments, data associated with 
particular tiles or wavefront entry points can be carried in separate NAL units (pack- 
ets) with lower latency than if they are all packetized in a single slice. However, they 
are not independently decodable, since they do not contain slice headers. 

The relationship between tiles, slices, slice segments, and CTUs is illustrated 
in Figure 8.18(a), where a picture consisting of 8 X 8 CTUs is divided into two 
tiles denoted by the vertical solid line. Note that tiles are always aligned with CTU 
boundaries. The first tile contains a single slice, which is divided into two slice seg- 
ments, whose boundaries are denoted by dotted lines. The first slice segment, marked 
by shaded CTUs, is an independent slice segment and the other is configured as a 
dependent slice segment. The second tile is split into two slices, each with one inde- 
pendent and one dependent slice segment. 


Wavefront Parallel Processing 


Wavefront processing enables subslice-level parallelism by partitioning slices into 
rows of CTUs. Each row can be processed by an independent thread, as depicted in 
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Figure 8.18(b), but there should be two-CTU processing delay between the threads. 
That is, processing of the second row can start only after the first two CTUs from 
the first row (A and B) have been completed, i.e., C and D can be processed in 
parallel, and processing of the third row can start after two CTUs from the second 
row (D and E) have been completed, and so on. Wavefront parallelism cannot be 
used with tiles and it often provides better compression efficiency compared to tiles, 
since intra-prediction or prediction of motion vectors cannot be performed across 
tile boundaries. 


8.4.4 Other Tools and Improvements 


HEVC features a number of small improvements that yield large coding gains when 
combined together. 


Intra-Prediction 


In intra-prediction all PBs are square with the same size as TBs, i.e., 32 X 32, 
16 X 16, 8 X 8, or 4 X 4. Intra-prediction partition follows the TB quad-tree. The 
intra-prediction in HEVC supports DC, planar, and angular prediction modes for 
all TB sizes and slice types. In the DC mode, all samples in a block are set equal to 
the mean value of boundary samples. Planar mode predicts all samples in a block by 
fitting a planar amplitude surface, where horizontal and vertical slopes are computed 
from the boundary samples. In the angular prediction mode, HEVC defines 33 pre- 
diction directions as opposed to 8 directions in H.264/AVC. The projected sample 
locations are computed with 1/32 sample accuracy using bi-linear interpolation of 
samples with the closest integer locations. For improved accuracy, reference-sample 
smoothing, boundary-value smoothing, and reference-sample substitution proce- 
dures may be employed in HEVC. 


Motion Compensation 


Inter-prediction mode allows symmetric and asymmetric CB partitions, as shown 
in Figure 8.17. HEVC specifies quarter-sample accuracy for luma motion vectors, 
similar to AVC, but uses separable 8-tap and 7-tap interpolation filters for half- and 
quarter-sample luma positions, respectively (as opposed to 6-tap and 2-tap bi-linear 
for half- and quarter-sample positions, respectively, in AVC) and 1/8 sample motion 
vector accuracy and 4-tap filter for chroma. All PB blocks need to be extended on 
all sides to provide the filter with the required boundary samples. HEVC supports 
weighted prediction for both uni- and bi-directional PBs, where the weights are 
explicitly transmitted in the slice header, and there is no implicit weighted prediction 
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as in AVC. Similar to AVC, HEVC has two reference lists, LO and L1, that can hold 
16 references each, but the maximum number of unique pictures is 8. The encoder 
may choose to add the same picture to the list more than once to enable predicting 
the same picture with different weights. 


Motion-Vector Coding 


There are two MV prediction modes: merge and advanced-motion vector predic- 
tion (AMVP). The encoder decides between these two modes for each prediction 
unit (PU) and signals it in the bit-stream. The merge process is a generalization of 
the direct mode in AVC, except that it explicitly states the reference picture list and 
index, whereas in the direct mode they take implicit values. When an inter-predicted 
CB is not encoded in skip or merge modes, AMVP is employed to encode a delta 
MV. Both merge and AMVP build a list of candidate MVs and then select one of 
them using an index coded in the bit-stream. 


Transform and Quantization 


The core transform in HEVC is a separable 32 X 32 integer transform that approxi- 
mates the DCT. The coefficients of the 16 X 16, 8 X 8, and 4 X 4 transforms are 
derived from this by sub-sampling. There is also an alternate 4 X 4 integer trans- 
form that approximates the discrete-sine transform (DST), which is applied to luma 
residuals in intra-predictive coding. 

HEVC uses the same uniform reconstruction quantization (URQ) scheme con- 
trolled by a QP as in the AVC. The QP values range from 0 to 51. There is an 
approximate logarithmic mapping between QP values and URQ step-sizes, where 
increasing QP by 6 corresponds to doubling the quantization step-size. HEVC also 
supports quantization-scaling matrices. 


Entropy Coding 


HEVC employs only CABAC for entropy coding. The core algorithm is the same 
as that in AVC; however, improvements are made in context modeling, adaptive 
coefficient scan, and coefficient coding for better compression efficiency. There 
are about half as many context-state variables in HEVC compared with AVC, and 
the initialization is simpler. CABAC decoding is inherently a serial operation, and 
fast/multi-thread hardware implementations of CABAC are difficult. Careful use 
of the by-pass mode and dependencies between coded data have been successfully 
exploited for throughput maximization of hardware decoders and multithread 
implementations. 
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In-Loop De-Blocking Filter 

De-blocking in HEVC is performed on 8 X 8 blocks only, unlike AVC where it is 
applied to 4 X 4 block edges. All vertical edges in the picture are de-blocked first, 
followed by all horizontal edges. The filter is similar to that in AVC, but only bound- 
ary strengths 2, 1, and 0 are supported. The 8-pixel separation between de-blocking 
filters enables parallel implementation, since edges do not depend on each other. 
Hence, it is possible to perform vertical edge filtering with one thread for each 
8-pixel column of the picture. Chroma is de-blocked only when one of the PUs on 
either side of a particular edge is intra-coded. 

Profiles, tiers, and levels define conformance points for implementing the stan- 
dard. Tiers define limits on the maximum bit-rate and coded picture buffer (CPB). 
HEVC currently defines three profiles, Main, Main10, and Main Still Picture, and 
two tiers, Main and High. Work is on-going on future extensions of HEVC includ- 
ing coding of extended range formats (bit-depth, color sampling), scalable video 
coding, and stereo/3D-video coding. 

An open-source implementation of HEVC encoders and decoders, known as 
the x265 application library, is available under the GNU GPL 2 license (http:// 
x265.org/). 


8.5 Scalable-Video Compression 


Scalable-video coding (SVC; also known as layered-video coding) refers to the gen- 
eral approach to video encoding where subsets of the bit-stream can be decoded to 
generate complete video sequences, whose spatial/temporal resolution and/or qual- 
ity vary according to the selected subsets. SVC also refers to the specific encoding 
methodology described in the Annex G extension of the H.264/AVC standard. The 
minimum decodable subset of the bit-stream is called the base layer. All other layers 
are enhancement layers that improve the resolution or quality of the base-layer video. 
In SVC, the base layer is encoded independent of enhancement layers. However, 
each enhancement layer is coded in reference to a previous lower level. This increases 
the compression efficiency of the scalable-video stream compared to simulcasting 
(which refers to coding each layer independently). 

Scalable video enables serving video to users with different displays (formats), 
bit-rate, and power requirements from a single stream, and allows decoders of differ- 
ent complexities to coexist. While low-performance decoders process smaller subsets 
of the bit-stream producing basic quality video, high-performance decoders may 
decode larger subsets to produce higher-quality video. Scalable video also enables 
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easy network adaptation for robust transmission over heterogeneous networks. In 
case of congestion or lower bandwidth connections, enhancement layers can be 
dropped, allowing the decoder to at least receive the base layer in its entirety. 

Temporal, spatial, and quality scalable video-coding tools were first introduced 
in the MPEG-2 specification, which allows for two or three layers of video. How- 
ever, MPEG-2 spatial and quality scalability features have resulted in a notice- 
able loss of compression efficiency while increasing encoder/decoder complexity. 
Wavelet-based approaches for SVC have also been studied but have not been found 
as efficient as the scalable extension of the H.264/AVC standard. Most recently, 
the HEVC standard was extended to include scalable coding features, called the 
SHVC, and was completed in July 2014. SHVC supports parallel encoding and 
decoding of ultra-high-definition videos [Ham 14], where the base layer can be an 
AVC bit-stream. This section only introduces the basic tools to provide temporal, 
spatial, and quality scalable bit-streams in the scalable extension of the H.264/ 
AVC standard, which successfully addresses compression efficiency and complexity 
problems. 


8.5.1 Temporal Scalability 


Temporal scalability refers to the ability to decode video at different frame rates. For 
example, in MPEG-2, since B-pictures cannot be used as reference, they can be 
dropped without affecting decodability of I- and P-pictures, providing a limited 
temporal scalability. In H.264/AVC, temporal scalability is provided with more flex- 
ibility by introducing hierarchical prediction structures by only adding syntax to the 
H.264/AVC design to signal temporal layers. 

A dyadic and a low-delay structure for hierarchical prediction are illustrated in 
Figure 8.19(a) and (b), respectively. In the dyadic structure using only B-pictures 
(also called hierarchical B-pictures; see Figure 8.11), all temporal layers 7,, />0, 
where 7, denotes the base layer, can be decoded independently by restricting the 
reference picture lists, listO and list1, for each picture of layer / to temporally pre- 
ceeding and succeeding pictures, respectively, with a temporal layer identifier less 
than / In Figure 8.19(a), pictures 0 and 1 (decoding order) belong to Ty (base 
layer), where picture 1 is undirectionally predicted from 0 (depicted by arrow). 
Picture 2, predicted from 0 and 1, belongs to 7,. Pictures 3 (predicted from 0 and 
2) and 6 (predicted from 2 and 1) belong to T,. Pictures 4, 5, 7, and 8, which are 
predicted from their immediate neighbors, belong to 73. It was found that hier- 
archical B-pictures not only enable temporal scalability but also provide superior 
compression efficiency compared to the traditional IBBPBBP... prediction struc- 
ture when the GOP size is 16 or 32 and the quantization parameter QP, of layer /is 
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re aie 


Figure 8.19 Hierarchical prediction structures: (a) dyadic B-pictures structure and 
(b) prediction structure for low-delay encoding/decoding. 


chosen according to QP, = QP, +3 +4 />0 [Sch 07]. It is recommended that the 
“spatial direct” mode of inter-picture prediction be employed when using hierarchi- 
cal B-pictures with / > 2. 

The hierarchical B-pictures introduce an encoding/decoding delay of one GOP, 
which may not be suitable for interactive (videophone) applications. Alternatively, 
we can use a uni-directional hierarchical prediction structure by using only list0 
in hierarchical B-pictures. This causal hierarchical prediction structure, shown in 
Figure 8.19(b), provides the same level of temporal scalability as the hierarchical 
B-pictures without introducing a structural delay, but at the expense of reduced 
compression efficiency. 


8.5.2 Spatial Scalability 


Spatial (pixel-resolution) scalability provides the ability to decode video at two or 
more spatial resolutions without first decoding all the full-resolution frames. The 
base layer is a low spatial-resolution video. Enhancement layers contain successively 
higher-resolution video. 

Spatial-scalable encoding employs a pyramid representation of each frame. 
Lower-resolution layers are obtained by successively decimating the full-resolution 
video. Enhancement layers consist of the difference between the current resolution 
layer and interpolation of the decoded lower layer video to the size of the current 
layer. SVC specifies an 11-tap decimation filter given by (1 0 一 5 0 20 32 200 —5 
0 1)/64 and a 6-tap interpolation filter given by (1 —5 20 20 —5 1)/32 for the case 
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Figure 8.20 Intra- and inter-layer prediction in spatial-scalable coding. 


the resolution of successive layers doubles. The generalized spatial-scalability option 
of SVC supports arbitrary resolution ratios as well as scalability for interlaced video. 

In addition to the single-layer prediction modes in H.264/AVC, SVC specifies 
new inter-layer prediction schemes where encoding mode, motion vectors, and/or 
enhancement-layer pixels can be predicted from the decoded lower-resolution lay- 
ers to improve the coding efficiency (see [Sch 07] for details). Figure 8.20 illustrates 
inter-layer prediction where enhancement layer pictures can be predicted from inter- 
polated versions of decoded base-layer pictures in addition to hierarchical prediction 
within each layer. SVC specifies that the same coding order is used in all layers. 

In order to limit decoder complexity, SVC introduces mandatory constraints 
that enable decoding with a single motion-compensation loop. As a result, the com- 
plexity overhead introduced by SVC is small compared with those of scalable profiles 
of previous standards. Spatial-scalable coding may result in a 10% to 50% increase 
in bit-rate at the same quality, depending on the properties of the specific video and 
selected prediction structure, when compared to single-layer encoding. 


8.5.3 Quality (SNR) Scalability 


SNR scalability offers the ability to decode video with different quality levels, where 
all layers have the same spatial and temporal resolution. SNR scalability is a valu- 
able tool to achieve error-resilient video streaming over heterogeneous networks, 
where the base layer may be served using better quality of service parameters and 
error-correction capabilities, whereas enhancement layers can be served as best-effort 
service using rate adaptation. 
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Coarse-grain SNR scalability refers to the case where the number of supported 
bit-rate points is limited by the number of layers. It can be considered as a special 
case of spatial scalability where all layers have the same resolution/size without any 
decimation/interpolation filtering, and inter-layer residual prediction is performed 
in the transform domain. 

Medium-grain SNR scalability (MGS) provides packet-based scalability within 
a given range of bit-rates in a nearly continuous manner, which is well-suited for 
network-rate adaptation in video streaming over the Internet, by dropping some of 
the enhancement layer packets. The base layer is obtained by a coarse quantization 
of DCT coefficients. Enhancement layers contain DCT-refinement coefficients to 
progressively increase the quality of decoded video. When decoded video packets 
(NAL units) can be unpredictably dropped, controlling drift due to encoder-decoder 
mismatch becomes a key issue. It is possible to completely eliminate drift by dis- 
abling the prediction from the enhancement layer (allowing only prediction from 
the reconstructed base layer), but this approach (used in the MPEG-4 fine granular 
scalability method) results in a significant loss of compression efficiency. The other 
extreme is to always allow prediction from the highest-quality enhancement layer 
(used in the MPEG-2 SNR scalability method), which results in significant drift 
that can only be managed by frequent intra-frame transmissions. SVC allows encod- 
ers to select a suitable tradeoff between compression efficiency and drift control by 
introducing the notion of key pictures, where the base layer is also stored in the DPB 
to resynchronize the encoder and decoder reconstruction processes that limits drift 
to between two key pictures. A combination of hierarchical B-pictures and key pic- 
tures (denoted by darker lines) for MGS coding with a GOP size 4 is illustrated in 
Figure 8.21, where base-layer pictures between two key pictures are predicted from 
the highest quality (enhancement layer) pictures. 


Enhancement 


Layer 


Base 


Layer 
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Figure 8.21 Hierarchical intra- and inter-layer prediction in MGS with key pictures. 
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Bit-stream extraction refers to obtaining video at a desired bit-rate (lower than the 
maximum supported rate) by discarding some MGS-encoded NAL units. Clearly, the 
same desired bit-rate can be attained by discarding different NAL units, which would 
yield different quality videos. The simplest method is to randomly discard NAL units 
until the desired bit-rate is reached. A more sophisticated approach is to determine 
the NAL units to be discarded by using a rate-distortion analysis. SVC syntax sup- 
ports assigning a priority identifier to each NAL unit by an encoder. Then, NAL units 
with the lowest priority are discarded first, followed by the next lower priority, and 
so on. Prioritization of NAL units leads to generation of the so-called priority layers. 

Since motion compensation is performed using a single loop, the complexity of 
a decoder supporting MGS is close to that of a single-layer H.264/AVC decoder. 
With proper encoder optimization, the bit-rate overhead of MGS encoding is 10% 
to 20% compared with single-layer H.264/AVC encoding at the same fidelity when 
the ratio of the highest supported bit-rate to the lowest is between 2 and 3. 


8.5.4 Hybrid Scalability 
Hybrid scalability refers to a combination of spatial, temporal, and SNR scalability 


provided in a single stream. The SVC bit-stream structure is organized in terms of 
dependency layers, which represent different spatial/temporal resolutions. Quality- 
refinement layers are defined within each dependency layer to allow for hybrid scal- 
ability. When inter-layer prediction is employed, the dependency identifier and the 
quality identifier of the reference layer must be signaled. Switching between different 
dependency layers is only allowed at predefined frames, whereas switching between 
quality layers is possible at any access unit. 

SVC also supports region-of-interest (ROI) scalability, which can be imple- 
mented using slice groups. However, the shape of the ROI can only be represented 
as a collection of macroblocks. 


8.6 Stereo and Multi-View Video Compression 


With recent advances in display technologies and video compression, 3D video, 
which offers immersive entertainment and communication experiences, including 
free-view TV, has become feasible and highly popular [Tan 12]. 3D-video formats 
include stereoscopic (stereo-pair), multi-view (n-view), and n-view plus n-depth 
representations. An overview of 3D-video-compression approaches and standards 


is provided in [Tan 14]. A summary of these approaches is depicted in Figure 8.22. 
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3D Video Compression Approaches and Standards 
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Figure 8.22 Overview of 3D-video compression. 


Frame-compatible formats have been developed to compress down-sampled stereo 
video by using existing (legacy) compression and transmission standards. While this 
approach is simple, it results in a loss of spatial resolution. 

The multi-view video coding (MVC) extension of the H.264/AVC standard was 
developed for efficient compression of full-resolution stereo and multi-view video 
using inter-view disparity-compensated compression. However, the total bit-rate still 
increases linearly with the number of views, and more efficient representation and 
compression schemes are required especially when the number of views is large. 

The n-view plus n-depth format has proven to be efficient for compression of 
multi-view video with a large number of views (e.g., 45 or more). Standardization 
of this format as extensions of the H.264/AVC and HEVC standards is in progress. 
Besides autostereoscopic free-view 3D-video, multi-view video formats can also be 
used in free-view 2D video, which can be viewed from multiple angles interactively 
using a conventional 2D display, and in computational imaging (e.g., synthetic aper- 
ture photography). 


8.6.1 Frame-Compatible Stereo-Video Compression 


Frame-compatible formats, where the right and left stereo frames are down-sampled 
and packed together in a single frame, allow adding stereo-video services with only 
a software upgrade of existing equipment and infrastructure. Some of the common 
frame-compatible stereo-video formats are illustrated in Figure 8.23, where R and L 
denote the pixels of the right and left views, respectively. In frame-compatible for- 
mats, the spatial resolution of the right and left views is reduced horizontally and/or 
vertically. Alternatively, it is possible to multiplex right and left views temporally in 
a frame/field sequential format, where the frame/field rates of the right and left views 
may be reduced by a factor of two. 
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Figure 8.23 Common frame-compatible stereo-video formats. 


Only the display sub-system that follows decoding needs to be modified in the 
user set-top box to offer 3D-video services to customers with 3D-capable displays 
over the existing transmission infrastructure using one of these formats. In order to 
correctly parse the left and right views from a frame-compatible format view, the dis- 
play sub-system must know the exact packing format. An early version of the H.264/ 
AVC standard (2004) supported stereo-video information (SVI) SEI messages that 
could signal a row-based interleaving of right-left views vs. a field-sequential order- 
ing of views, as well as whether inter-view prediction is enabled or disabled. The 
MVC extensions of H.264/AVC and H.265/HEVC now include an extended set of 
SEI messages, called frame-packing arrangement (FPA) SEI, which can signal all the 
packing formats depicted in Figure 8.23. 


8.6.2 Stereo and Multi-View Video-Coding Extensions of 
the H.264/AVC Standard 


The basic approach of stereo and MVC is to exploit redundancies that exist between 
neighboring views at a given time instant in addition to temporal redundancy 
between frames of a given view. A recent stereo and MVC extension of the H.264/ 
AVC standard specifies syntax and a set of tools that achieve a significant reduction 
in bit-rate relative to independent coding of the views without sacrificing the recon- 
structed video quality. 
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MVC has been an active research area since the early work on disparity- 
compensated prediction by Lukacs in 1986 [Luk 86]. The H.262/MPEG-2 stan- 
dard was amended in 1996 to support MVC by reusing tools originally intended 
for temporal scalability. However, multi-view extension of MPEG-2 video was never 
adopted in the market mainly because: i) transition from analog to digital TV and 
HDTV was a big challenge at that time, ii) 3D-display technology and flat-panel 
hardware were lacking at that time, and iii) MPEG-2 stereo coding did not offer a 
compelling compression improvement due to limited coding tools. 

A key feature of the MVC extension of H.264/AVC is its high compression effi- 
ciency compared with independent coding of views without changing the low-level 
syntax and decoding process of H.264/AVC. This design constraint enables hard- 
ware implementation of MVC decoders with only simple changes to existing H.264/ 
MPEG-4 AVC decoding chipsets. The MVC bit-stream consists of a base view that 
is coded independently of other views and must conform to one of the H.264/AVC 
profiles, e.g., High profile, and supplementary views that may be dependent on the 
base view and each other. The base view is encapsulated in AVC video NAL units, 
so they can be decoded by legacy AVC decoders. Other views are encapsulated in 
an extension NAL unit type that is also used for SVC bit-streams. A flag is used to 
distinguish between SVC and MVC-NAL units, which can be decoded by MVC 
decoders and discarded by legacy decoders. The main technical novelties in MVC 
include introduction of anchor pictures for efficient inter-view prediction, as well as 
time-first coding and reference-picture management to achieve low-delay encoding/ 
decoding and optimal memory consumption at the decoder. 

MVC introduces a new picture type, called anchor picture, which is similar to 
IDR pictures in that temporal prediction is not allowed; however, inter-view predic- 
tion from other views within the same access unit is allowed. Any picture that follows 
the anchor picture in both decoding order and display order cannot use any picture 
that precedes the anchor picture in decoding order as a reference for interpicture 
prediction, and any picture that precedes the anchor picture in decoding order can- 
not follow it in display order. This provides a clean random-access point for a given 
view. In MVC, inter-view prediction is adaptive and the best predictor in terms of 
rate-distortion cost between temporal or inter-view references is chosen on a block 
basis. It has been observed that most of the coding gain that comes from inter-view 
prediction is realized at the anchor pictures [Vet 11]. Hence, turning inter-view pre- 
diction off at non-anchor frames saves memory and coding delay. 

In developing the MVC, two different decoding orders, view-first coding and 
time-first coding, have been considered. In view-first coding, pictures of each view 
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within each group of pictures (GOP) are contiguous in their decoding order. Coded 
pictures belonging to different views at the same time instant are interleaved with 
other pictures at other times, and thus cannot be in the same access unit. These dif- 
ferent access units, when stored in an ISO base media file, which requires samples to 
be ordered in their decoding order, have composition time offsets proportional to the 
GOP size multiplied by the number of views that causes a significant initial buffering 
delay. Hence, MVC uses time-first coding, where pictures at any time instant are con- 
tiguous in the bitstream. We can then define pictures at the same time but belonging 
to different views as one access unit, and an access unit contains NAL units continu- 
ous in decoding order [Che 09]. 

During the development of the MVC, a number of macroblock-level coding 
tools were also explored, including: 


。 Adaptive reference filtering, which compensates for focus mismatches between 
different views. 

e View-synthesis prediction, which predicts a picture in the current view from 
synthesized references generated from neighboring views to achieve additional 
coding gains [Yea 09]. 

。 Illumination compensation, which compensates for illumination differences as 
part of the inter-view prediction process. 

。 Motion-skip mode, which infers motion vectors from inter-view references not- 
ing the correlation between motion vectors in different views. 


While illumination compensation and motion-skip mode offer notable gains, 
they have not been adopted into the MVC standard because they require changes 
affecting macroblock level encoding and decoding processes causing implementation 
concerns [Vet 11]. 


Asymmetric Coding of Stereo Video 


According to the suppression theory of stereo human vision, the human-vision sys- 
tem can tolerate absence of high frequencies in one of the views; therefore, the left 
and right views of stereo video can be represented at unequal resolutions or bit- 
rates. Asymmetric coding refers to encoding the non-base view with lower quality 
than the base view, where the non-base view can be significantly blurred or more 
coarsely quantized (resulting in bit-rate reduction), or coded with a reduced spatial 
resolution. Substantial savings in bit-rate can be achieved by using asymmetric cod- 
ing without a perceptible impact on stereo-video quality. It has been shown that 
asymmetry by blurring provides finer control over achievable PSNR values [Say 11]; 


8.6 Stereo and Multi-View Video Compression 487 


hence, it is superior to asymmetry by spatial-resolution reduction at high bit-rates 
where it provides better rate-distortion (RD) performance. The MVC standard pro- 
vides the encoder with the freedom to select the fidelity for each view by performing 
pre-processing, such as blurring if desired; however, it uses the same sample-array 
resolution for the encoding of all views. Extensive subjective tests have been con- 
ducted to demonstrate the performance of asymmetric coding using short video 
clips; however, further study is needed to understand whether it would cause eye 
fatigue in longer duration videos. 


8.6.3 Multi-View Video Plus Depth Compression 


While coding of two fixed views, using a frame-compatible format or MVC as 
described above, provides basic 3D perception on stereoscopic displays, it is not 
suitable for disparity adjustment between views for adaptation to different displays 
and viewing conditions, which has been shown to provide a superior 3D experi- 
ence. Moreover, state-of-the-art auto-stereoscopic displays require displaying a large 
number of views (e.g., 45 or more), which would require significant bit-rates if all 
views were encoded by MVC. The multi-view video plus depth (MVD) format is a 
versatile 3D-video format that consists of a small number of views (texture) and 
their associated depth maps [Mul 11]. The depth value at each pixel is represented 
by monocular-depth images that have minimal high-frequency content and can be 
compressed efficiently. Using the MVD, stereo-disparity adjustment or additional 
intermediate view synthesis can be computed at the decoder using depth-image- 
based rendering (DIBR). We can classify recent work on MVD compression into 
two groups: i) those compatible with the H.264/AVC standard [Han 13] and (ii) 
those compatible with the H.265/HEVC standard [Mul 13]. The former group 
includes MVC+D and 3D-AVC configurations, which are explained below. 


Multi-View Coding Extensions for Inclusion of Depth Maps 


MVC4+D, finalized in January 2013, is a backward-compatible extension of the 
MVC standard for inclusion of depth maps. It specifies the encapsulation of MVC- 
coded texture views and depth maps into a single bit-stream [Han 13]. Texture only 
views of MVC+D bit-streams can be decoded with an MVC decoder. The depth 
maps, together with the high-level syntax signaling the necessary information for 
interpretation of the depth data, are represented by an independent second stream, 
which can be encoded by MVC as if it were a multi-view monochrome video. Inter- 
lace coding for texture or depth is supported for stereo views. 
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MV-HEVC is an extension of the HEVC standard to provide efficient repre- 
sentation of multi-view video and optional depth map information, e.g., for 3D 
stereoscopic and autostereoscopic video applications. It was finalized in July 2014. 


3D-AVC 


3D-AVC is an advanced coding process that jointly encodes dependent (non-base) 
views and depth images to achieve better compression efficiency than MVC+D. 3D- 
AVC encodes a base view that is backward compatible with the H.264/AVC standard 
and independent of other views to support legacy monoscopic receivers. We note 
that a decoder supporting 3D-AVC can also decode MVC+D bit-streams. 

A 3D-AVC access unit is formed by all the video-texture and depth-map-view 
components that describe a 3D scene at a particular time instant. The data of a 
coded-view component is not interleaved by any other coded-view component, 
and the data for an access unit is not interleaved by any other access unit in the 
bit-stream/decoding order. The AVC/MVC-compatible video texture view com- 
ponents are coded before the respective depth-view components. Enhanced video- 
texture-view components are coded after the respective depth-view components. The 
video-texture and depth-view components of the same access units are coded in 
view-dependency order. Examples of coding order for an access unit include: 


1. TO, DO, T1, D1, ... (two AVC/MVC-compatible texture views) 
2. T0, DO, D1, T1, ... (an AVC-compatible view, an enhanced texture view) 


where T and D denote texture and depth map and the numerals indicate view 
numbers. 

The 3D-AVC specification includes several advanced coding tools that are briefly 
described in the following. These advanced coding tools do not support interlaced- 
video coding [Han 13]. 


1. Baseline-depth coding tools: 


a. Non-linear depth representation (NDR): This tool enables representing 
closer objects more accurately than distant ones. If NDR is turned on, the 
depth map is nonlinearly mapped through a forward lookup table at the 
preprocessing stage of the encoder and inversely mapped back to the origi- 
nal representation at the post-processing stage of the decoder. This tool is 
none-normative for MVC+D. 

b. Reduced-resolution depth coding: Flexible depth-to-texture resolution ratio 
is allowed, e.g., depth resolution equal to 1/2 of luma resolution vertically 
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and horizontally. Encoder can control the depth resolution relative to the 
luma-texture resolution with horizontal and vertical scale factors as well as 
shift values in the 3D-AVC sequence parameter set. 

Enhanced-depth coding tools: 


a. Depth-Range-Based Weighted Prediction (DRWP): This tool performs a 
non-linear compensation of the depth map. 

b. In-Loop Joint Inter-View Depth Filtering (VDF): Depth-map images of 
available views are filtered jointly. 

c. Motion prediction from texture to depth: Since a texture view and its asso- 
ciated depth-view component have similar objects, there is redundancy in 
their motion fields. This tool can be applied only for depth views for AVC/ 
MVC-compatible texture views. 

d. Depth Intra-Prediction includes depth intra-skip prediction and plane-seg- 
mentation-based intra-prediction (PSIP). 


. Enhanced texture-coding tools applicable to dependent video views: 


a. In-Loop Block-Based View Synthesis Prediction (VSP): A decoded texture- 
view component is projected to the viewing point of the current (de)coded 
dependent view using DIBR, given the camera parameters. The projected 
image is included in the reference picture list(s) and serves as a reference for 
MCP. 

b. Depth-Based Motion Vector Prediction (DMVP): This tool consists of 
direction-separated motion-vector prediction for the inter-mode and 
disparity-based skip and direct modes for further improving the accuracy 
of motion-vector predictors. 

c. Inter-view coding with adaptive luminance compensation (ALC): This 
tool suppresses local illumination changes between encoded macroblocks 
and predicted blocks that belong to an interview reference frame. 

VSP and DMVP perform joint texture and depth coding, where samples of 

depth data are utilized for efficient coding of texture. This introduces an inter- 

component dependency. 

“Slice header prediction” can be used for both 本 depth and enhanced- 

texture views. 


Non-normative tools that can be used with MVC+D and 3D-AVC: 


a. Gradual view refresh (GVR) for texture and depth coding 
b. Rate-distortion optimization through view synthesis distortion (VSD) 
c. Post-processing dilation filtering (PDF) for depth map 
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3D-HEVC 


The 3D-HEVC encoder takes multiple views, associated depth maps, and corre- 
sponding camera parameters as input, although it can also operate without depth 
data. Various scenarios for the use of 3D-HEVC codec are depicted in Figure 
8.24. At the decoder, additional intermediate views can be rendered by a view 
synthesizer for display on a multi-view auto-stereoscopic display. View synthesis 
can be performed by a DIBR algorithm using the reconstructed views and depth 
data or by image-domain warping without depth data. The encoder can be config- 
ured such that a sub-bit-stream containing only two stereo views can be extracted 
and decoded using a stereo decoder. The view synthesizer can also render a stereo 
pair for a conventional stereo display, in case a stereo pair is not present in the 
bit-stream or to adjust the stereo views to the viewing conditions. The base view, 
which can be extracted and decoded by using an unmodified HEVC decoder, or 
one of the views decoded by 3D-HEVC decoder or a synthesized intermediate 
view at an arbitrary virtual camera position, can also be displayed on a conven- 
tional 2D display. 

A 3D-HEVC access unit includes all video pictures and depth maps that cor- 
respond to the same time instant (time-first coding). NAL units containing camera 
parameters may also be associated with an access unit. The reconstructed data of 
already-coded access units can be used for coding the current access unit. Random 
access is enabled by IDR access units. 
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Figure 8.24 3D-HEVC use scenarios. 
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The 3D-HEVC offers the following new tools: 


1. Video View Coding: New tools for coding dependent views include disparity- 
compensated prediction (DCP), advanced residual prediction (ARP), illumi- 
nation compensation (IC), view-synthesis prediction (VSP), and depth-based 
block partitioning (DBBP). The well-known concept of DCP, also used in 
MVC and 3D-AVC, was added as an alternative to MCP. ARP is a coding tool 
to exploit the residual correlation between views. IC uses a linear model to 
adapt luminance and chrominance of inter-view predicted blocks to the illumi- 
nation of the current view. VSP provides a predictor using depth information 
to reduce inter-view redundancy. DBBP partitions the collocated texture block 
based on a binary segmentation mask computed from the depth map. Each of 
the two partitions (e.g., foreground and background) is motion compensated 
and then merged using a depth-based segmentation mask. 

2. Depth-Map Coding: Depth-map coding employs the same intra-prediction, 
MCP, disparity-compensated prediction, and transform coding concepts as 
video coding. However, some tools have been modified for depth maps, other 
tools have been generally disabled, and additional tools have been added. The 
inter-view motion, residual prediction, view synthesis prediction, and in-loop 
filters are not used for depth coding. Instead, motion parameters are derived 
based on coded data in the associated video pictures. New additions include 
new intra-coding modes, modified motion compensation and motion-vector 
coding, and motion-parameter inheritance. 

3. Motion Coding: Inter-view motion prediction derives motion parameters for 
a block in a current picture based on motion parameters in an already coded 
reference view and an estimate of the depth map for the current picture. The 
motion data is compressed into 1/4 resolution after encoding/decoding of 
each picture and then further compressed into 1/16 resolution after encoding/ 
decoding of all the pictures within the same access unit in order to reduce buffer 
size and memory bandwidth. 

4. Encoder Control: For mode decision and motion estimation, similar to encoder 
control in other standards, a Lagrangian technique that minimizes a cost mea- 
sure D+A:R, where D is the decoding distortion of a particular mode with 
particular parameters for the considered block, R is the number of bits required 
for coding the block with these parameters, and A is the Lagrangian multi- 
plier that is a function of the quantization parameter, is used for each candi- 
date mode or parameters, and the mode or parameters with the smallest cost 
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measure is selected. As measure of the distortion, the sum of squared differences 
(SSD) or the sum of absolute differences (SAD) between the original and the 
reconstructed sample values is used. For the coding of depth maps, the same 
decision process is used. However, the distortion measure was replaced with a 
measure that considers the distortion in synthesized intermediate views using 
an encoder-side render model. 

5. Decoder-Side View Synthesis: A fast 1D-view synthesis method based on DIBR 
for generating the required number of display views is provided. 
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Exercises 


8.1 The types of frames in a group of pictures (GOP) of MPEG-2 coded video and 
their display orders are shown below. Write the coding/decoding order for this 
GOP, and show the reference frames used in motion compensating each frame. 


IBBBRBBBEBBBER BB BB BB 


0123435367 8 9101112131415 


8.2 Explain intra-prediction in H.264/MPEG-4 AVC. Discuss different options 
briefly. 


8.3 Show the coding/decoding order for the following GOP in H264/AVC video 
coding using hierarchical B-pictures, and show the reference frames used for 
motion compensating each frame. 


IBBBBBBBBSBBBBSB B 


O123 4367 8 9 1002 12:15:1435 
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8.4 Compare motion JPEG2000 vs. all-intra-AVC video encoding conceptually. 
Which one do you think would have higher compression efficiency? 


8.5 What tools can be used in an AVC encoder to avoid error propagation in trans- 
mission of coded video over error-prone networks? Explain in detail. 


8.6 What is rate control in video coding? When do we need rate control in video 
encoding? Describe a simple rate-control algorithm. 


8.7 Discuss constant-quality vs. constant-rate video encoding. How do you achieve 
constant-quality video encoding? Can you achieve constant-quality encoding 
while satisfying a desired average bit-rate? 


8.8 What is drift in scalable video coding? What is a key picture? Discuss approaches 
to avoid drift with their effect on coding efficiency. 


8.9 Discuss the overhead bit-rate requirements for temporal, spatial, and quality 
SVC. Discuss degradation introduced by dropping enhancement layers in each 
type of scalability for different types of video content. 


8.10 Discuss the pros and cons of SVC vs. adaptive stream (bit-rate) switching for 
video streaming over the Internet. 


8.11 Discuss the bit-rate requirements for coding multi-view video as a function 
of distance between cameras. How does the bit-rate requirement vary by the 
number of views? 


8.12 Discuss the bit-rate requirements for coding depth-map images compared to 
video views. Explain why is it easier or more difficult to encode depth maps. 


Internet Resources 


VideoLAN x.264 Free Software Library and Application 
http://www.videolan.org/developers/x264.html 


x.265 Open Source Project 
http://x265.org/ 


Web-M project, VP8 and VP9 codecs 
hetp://www.webmproject.org/ 


APPENDIX A 


lll-Posed Problems 
in Image and Video 
Processing 





Many image/video-processing problems are ill-posed whose analysis and solution 
requires a strong mathematical foundation and proper image/video/motion model- 
ing. This appendix aims to summarize fundamental modeling approaches in modern 
image and video processing, and provide the foundation for sparse image represen- 
tations, to help put some well-known regularization methods for solving ill-posed 
image- and video-processing problems on a common framework. 


A.1 Image Representations 
A.1.1 Deterministic Framework — Function/Vector Spaces 


We can represent a discrete image s(7,,2,) by an NX N matrix of pixels or by 
an N?X1 vector s=[s, s al by mapping pixels (n,n,) into a 1D order 
j=l N 2 e.g., by lexicographic ordering. For video, we add a time dimension 
with causal ordering in time. We assume the inner product and norm are defined 
in this (Hilbert) space in the usual sense. Then, a specific image/video processing 
problem can be formulated as optimization of /2-, /1-, or /°-norm of a suitable error 
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function subject to some constraints. These norms can also be used to induce some 
probability distribution over an ensemble of images as will be discussed next. 


A.1.2 Bayesian Framework — Random Fields 


Let’s now consider a family or ensemble of images, such as all natural images or all 
smooth images, S = {s,,i = 0,1,2 ...}, such that s, E RN’. If we assume that pix- 
els j have values s, in the range [0, 1), images in an ensemble do not populate the 
hyper-cube [0, 1)” C RY uniformly. We can model the distribution of images in a 
particular ensemble by using a probability density function (pdf), called the a priori 
distribution P(s), which leads to the Bayesian framework for image processing, where 
we typically optimize the mean-square error (minimum mean-square error estima- 
tion), the conditional probability distribution (maximum-likelihood estimation), or 
a posteriori probability distribution (maximum a posteriori probability estimation) 
subject to some constraints. The extension of the Bayesian framework for video is 
straightforward. 


A.2 Overview of Image Models 


Many image- and video-processing problems are ill-posed in the sense that the solution 
is not unique, and/or it is highly sensitive to the presence of noise. Examples of such 
problems include image-gradient estimation; interpolation; inverse problems (includ- 
ing de-noising, restoration, super-resolution, and inpainting) of the form y = Hs + v, 
where H is an over-determined or under-determined matrix and vis observation noise; 
and 2D/3D motion estimation/tracking. In general, it is not possible to find an 
acceptable solution s to these problems without making some assumptions about the 
nature of the solution, i.e., employing 4 priori models of the solution, which is called 
regularization. A regularization method transforms an ill-posed problem to a 
well-posed problem, whose solution is an acceptable approximation to the solution 
of the ill-posed problem. For example, for image interpolation, we assume the image 
is bandlimited; for gradient estimation, we assume the image is smooth; for inverse 
problems, we assume the image is smooth and/or sparse in some transform domain; 
for motion estimation/tracking, we assume the motion is continuous over time, etc. 
The goodness of the solution depends on the suitability of the model used to solve the 
problem at hand. We can broadly classify image/video/motion models as: 


1. Smoothness Models: Perhaps the simplest model used in the image-processing 
and computer-vision community is the assumption that the solution varies 
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slowly over space, i.e., it is smooth. In a deterministic framework, smoothness 
can be modeled by minimizing ||L s||?, where ||-||, denotes /?-norm and L is a 
Laplacian matrix (defines a linear space-invariant filter applied to the image s) 
subject to some observation constraint. In a probabilistic framework, this can be 
expressed as a homogeneous random field with Gibbs a priori probability distri- 
bution, given by P(s) ~ exp{—A||Ls||?}. Then, deviation from spatial smooth- 
ness, measured by the Laplacian operator, is used as a measure of likeliness of a 
solution. Note that the smoother the image, the smaller ||Ls||3; hence, smoother 
images are more likely. This prior is well-known to be related to both Tikhonov 
regularization and Wiener filtering and is extensively used in image processing. 
We also note that Gibbs random fields are a particular instance of the more gen- 
eral class of image models known as Markov random felds (see Appendix B). 

2. Edge (Singularity)-Preserving Models: The smoothness prior defined in terms of 
the Laplacian and /?-norm is known to cause image over-smoothing when used 
in image denoising, inverse problems, and motion estimation. The /?-norm 
strongly penalizes (i.e., makes highly unlikely) any large local differences such as 
edges and motion discontinuities, which are key features for visual perception. 
A possible remedy is replacement of the /*-norm by a more robust measure 
such as the /!-norm that penalizes large values less, and the resulting pdf is 
allowed to have heavy tails. Thus, a prior of the form P(s) ~ exp{—A||L s||,} has 
recently become popular. Alternatively, the “total-variation” prior also promotes 
smoothness by replacing the Laplacian with gradient norms, thereby using first 
derivatives rather than the second. 

3. Sparse Models: Sparse and low-rank representations have recently become very 
popular in image processing. The sparsity of an image representation can be 
measured by the /°-norm of the coefficients of a representation using a com- 
plete or over-complete dictionary, which corresponds to the number of non-zero 
components in this representation. More formally, ||T s|| = lim »oollT sl = 
{4 ż, (Ts), Æ 0}. Optimization of /°-norm results in an NP-complete problem 
that is intractable. Interestingly, a convex relaxation of such problems can be 
formulated in terms of the /'-norm, which also enforces sparsity. That is, we 
can minimize ||Ts||, subject to some observation constraint (data term) to pro- 
mote a sparse solution. Note that sparsity of the orthogonal-wavelet transform 
has been well exploited for image denoising using wavelet shrinkage. In the 
Bayesian framework, the orthogonal wavelet transform of an image s, given by 
Ts, can be used to define an image prior P(s) ~ exp{—A||Ts||2}, with p = 1 to 


promote sparsity. 


500 Appendix A. Ill-Posed Problems in Image and Video Processing 


A.3 Basics of Sparse-Image Modeling 


There has been much work on transform domain representation of images including 
2D-DFT, 2D-DCT, and 2D-wavelet transforms, where transform-domain represen- 
tations consist of a set of expansion coefficients with respect to some basis images. 
Image representations in the transform domain are usually sparse, meaning images 
can be represented by a small number of coefficients, which is the basic principle of 
image-compression methods. Recently, such transform-domain representations have 
been extended to redundant (overcomplete) image representations using a dictionary 
D of atomic images, where an N? X 1 image s can be expressed as 


s = Da +e 


The dictionary D is an N? X M matrix, where M= N? (it is redundant for 
M> N?) and each column represents an atomic image. The vector e denotes the 
model error (mismatch) with finite energy, i.e., ||e||} < 6?. The vector @ is called 
a sparse representation (most of its entries are zero) based on this dictionary. The 
sparsity of œ is measured by its /°-norm ||a||,. 

How do we find the sparsest vector Q that models s as a linear combination of 
columns from D with an error energy no larger than 6”? An exact optimal solution 
of the problem 


& = argmin || æ ||, subject to || s —Da ||}< 5? (A.1) 


is computationally intractable. There are two approaches to reach an acceptable solu- 
tion: i) employ greedy methods, such as the basis pursuit [Che 01], to find a sub- 
optimal solution, and ii) solve a convex relaxation of the problem (A.1) in terms of 
the /!-norm, 


å =argmin || a ||, subject to ||s — De ||} < 5? (A.2) 


which also enforces sparsity. 

This basic model should often be tuned to specific images/applications by apply- 
ing it locally and selecting the best parameters that describe the local image char- 
acteristics and by choosing an appropriate dictionary, which can be universal or 
learned from a set of example images. 
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A.4 Well-Posed Formulations of Illl-Posed Problems 


Well-posed formulations of ill-posed problems can be obtained by regularization 
approaches. Most ill-posed image- and video-processing problems can be regular- 
ized by formulating them as constrained optimization (deterministic framework) 
or Bayesian estimation (stochastic framework) problems. As an example, lets take 
inverse problems that can be formulated as: given y = Hs + v, where the degrada- 
tion matrix H is known and observation noise v has finite energy, i.e., ||v||; = co", 
estimate s. 


A.4.1 Constrained-Optimization Problem 


The classical constrained least-squares estimation formulation seeks the smoothest 
image that satisfies the observation equation constraint 


§ =argmin. ||L s| subject to ||y — Hs ||} = o? (A.3) 
gmin, ||L s ||; subj y 2 


The constraint expresses prior knowledge on the degradation and noise, which 
limits the solution space. We can employ the Lagrangian method to convert a con- 
strained optimization problem into an unconstrained optimization problem by 
defining the constraint (data term) as a penalty term 


§ = arg min, || Ls ||? +A(||y—Hs ||; —o°) (A.4) 


for p=1,2. This unconstrained-optimization problem can either be solved analyti- 
cally by differentiating the cost function with respect to unknowns and setting the 
resulting equations equal to zero, or numerically by one of the methods discussed in 
Appendix C. 

Alternatively, the problem has been formulated recently using sparse-image rep- 
resentations as 


$=argmin ||@||? subject to ||y — HDa ||; = o? (A.5) 


for p=0,1, emphasizing sparseness of the solution in some transform domain. The 
solution of inverse problems using sparse models has been discussed in [Fig 03, 
Dau 04, Com 05, Ela 10]. 
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A.4.2 Bayesian-Estimation Problem 


Bayesian-estimation problems are commonly stated as maximum-likelihood (ML), 
maximum a posteriori (MAP), or minimum mean-square error (MMSE) estimation 
problems. Most image- or motion-estimation problems can be posed as a MAP esti- 
mation problem defined as 


$ = arg max P (s | y) = arg max P (y | s) P (s) (A.6) 


which seeks for the most probable image § in the sphere ||y 一 Hs||} = o°. The 
pdf P(sly) denotes the a posteriori probability distribution of the unknown s (image 
or motion vectors) given the observations y = Hs + v. The pdf P(y|s) is called the 
conditional probability distribution of the observations given s; hence, it models the 
distribution of the observation noise and enforces the data term, i.e., the constraint 
lly — Hs||3 < o°. The pdf P(s) is the a priori signal model that is discussed in Sec- 
tion A.2. Te optimization of (A.6) is often implemented by one of the numerical 
optimization schemes discussed in Appendix C. 
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APPENDIX B 


Markov and Gibbs 
Random Fields 





Markov random fields (MRFs) specified in terms of Gibbs distributions have become 
popular as æ priori models in Bayesian formulations for image-processing applica- 
tions such as texture modeling and generation [Cro 83, Che 93], image segmenta- 
tion and restoration [Gem 84, Der 87, Pap 92], and motion estimation [Dub 93]. 
This appendix provides the definitions of an MRF and the Gibbs distribution and 
then describes their relationship by using the Hammersley—Clifford theorem. The 
specification of MRFs in terms of Gibbs distributions has led to the term “Gibbs 
random field” (GRF). We also discuss how to obtain the local (Markov) conditional 
pdfs from the Gibbs distribution, which is a joint pdf. 


B.1 Equivalence of Markov Random Fields 
and Gibbs Random Fields 


We start with defining a random field. A scalar random field z = {z(x),x E A} isa 
stochastic process defined over a lattice A. Let denote a realization of the random 
field z. Recall that the random field z(x) evaluated at a fixed location x is a ran- 
dom variable. It follows that a scalar random field is a collection of scalar random 
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variables, where a random variable is associated with each site of the lattice A. A 
vector random field, such as velocity or displacement fields, is likewise a collection 
of random vectors. In the following, we limit our discussion to scalar random fields. 
A random field z can be discrete-valued or real-valued. For a discrete-valued random 
field, z(x) assumes a set of discrete values, i.e., z(x) ET = {0,1,...,2 — 1}, and for a 
real-valued random field z(x) E€ R, where R denotes the set of real numbers. 

The first step in defining MRFs and Gibbs distributions is to develop a neigh- 
borhood system on A. Let N, denote a neighborhood of a site x E A with the 
properties: 


l. xÉ N, and 
E AT N Ox (= No for all XX; EA 


In words, a site x does not belong to its own set of neighbors, and if x, is a neighbor 
of X, then x, must be a neighbor of x,. A neighborhood system N over A is defined 
as N= {N_,x E A}, the collection of neighborhoods of all sites. Two examples of 
neighborhood N, of a site are depicted in Figure B.1. 


B.1.1 Markov Random Fields 


MRFs are extensions of 1D causal Markov chains to 2D, and have been found useful 
in image modeling and processing. MRFs have been traditionally specified in terms 


of local conditional probability density functions (pdfs), which limits their utility. 


Definition. The random field z = {z(x),x € A} is called an MRF with 
respect to N if 


p@ > 0 for all z 


and 
Pp(z(x,) | 2(x,), all x, 天 x,) = p(ax,) | 2(x,), only x, E N,) 


where the pdf of a discrete-valued random variable/field is defined in terms 
of a Dirac delta function. 


‘The first condition states that all possible realizations should have non-zero prob- 
ability, while the second requires that the local conditional pdf at a site x, depends 
only on the values of the random field within the neighborhood N, of that site. 


B.1 Equivalence of Markov Random Fields and Gibbs Random Fields 505 


O 
oo O O=O I 
O 


4-neighborhood singleton horizontal vertical 


OOO © 


O@oO 


J 
oo ee 


8-neighborhood 





Figure B.1 Examples of neighborhoods: (a) 4-pixel neighborhood and associated cliques and 
(b) 8-pixel neighborhood and associated cliques. 


The specification of an MRF in terms of local conditional pdfs is cumbersome 
because 


e the conditional pdfs must satisfy some consistency conditions [Gem 84], which 
cannot be easily verified, 

e computation of the joint pdf p(z) from the local conditional pdfs is not straight- 
forward, and 

。 the relationship between the local spatial characteristics of a realization and the 
form of the local conditional pdf is not obvious. 


Fortunately, every MRF can be described by a Gibbs distribution (hence the 
name Gibbs random field — GRF), which apparently overcomes these problems. 

In order to define a Gibbs distribution, we need to define a clique. A clique C, 
defined over the lattice A with respect to the neighborhood system N, is a subset 
of A (C C A) such that either C consists of a single site or all pairs of sites in C are 
neighbors. The set of cliques for the 4-pixel and 8-pixel neighborhoods are shown in 
Figure B.1. Notice that the number of diferent cliques grows quickly as the number 
of sites in the neighborhood increases. The set of all cliques is denoted by C. 


B.1.2 Gibbs Random Fields 


The Gibbs distribution, with a neighborhood system N and the associated set of 
cliques C, is defined for discrete-valued random fields as 


U(z=0) 


P= Toe T Daa B.1) 
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where 8(-) denotes a Dirac delta function, and the normalizing constant Q, called the 
partition function, is given by 


_U(z=0) 


aep 





and, for continuous-valued random fields as 
_U(z) 


pw= O° P (B.2) 


where the normalizing constant Q is given by 





_U(2) 
ü=]: e da 


R 





and U(z), the Gibbs potential (Gibbs energy), is defined as 


U(z)= cec Ve (z(x),x €C) (B.3) 


for both the discrete and continuous-valued random fields. Each V-(2(x),x E C), 
called the clique potential, depends only on the values z(x) for which x € C. The 
parameter 7, known as the temperature, is used to control the peaking of the dis- 
tribution. Note that Gibbs distribution is an exponential distribution that includes 
Gaussian as a special case. 

Gibbs distribution is a joint pdf of all random variables composing the random 
field, as opposed to a local conditional pdf. It can be specified in terms of certain 
desired structural properties of the field that are modeled through the clique poten- 
tials, which is demonstrated in Section B.2. 


B.1.3 Equivalence of MRF and GRF 


The equivalence of an MRF and a GRF is stated by the Hammersley—Clifford theo- 
rem, which provides a simple and practical way to specify MRFs through Gibbs 
potentials. 


Hammersley—Clifford (H—C) Theorem 


Let N be a neighborhood system. Then z(x) is an MRF with respect to N if and only 
if p(z) is a Gibbs distribution with respect to N. 
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The H-C theorem was highlighted by Besag [Bes 74], based on Hammersley and 
Clifford’s unpublished paper. Spitzer [Spi 71] also provided an alternate proof of the 
H-C theorem. The work of Geman and Geman [Gem 84] pioneered the popular 
use of Gibbs distributions to specify MRF models. Section B.2 demonstrates how to 
select clique potentials for continuous and discrete valued GRFs, and how to specify 
a Gibbs distribution by using clique potentials. 


B.2 Gibbs Distribution as an a priori pdf Model 


The Gibbs distribution has been used as a popular a priori pdf model to impose the 
spatial-smoothness constraint in motion estimation (Chapter 4) and image/video 
segmentation (Chapter 5). This section demonstrates how a spatial-smoothness con- 
straint can be formulated as an a priori pdf in the form of a Gibbs distribution. The 
clique potentials effectively express the local interaction between pixels and can be 
assigned arbitrarily, unlike the local pdfs in MRFs, which must satisfy certain con- 
sistency conditions. 


Example: Case of a Continuous-Valued GRF 


A real-valued motion-vector field can be modeled by a continuous-valued 
GRE Let us employ a four-point neighborhood system, depicted in Figure 
B.1, with two-pixel cliques. For continuous-valued GRE, a suitable potential 
function for the two-pixel cliques may be 


Ve (d(x,),d(x,)) = []dx,) — d(x, |P (B.4) 


where x, and x, denote the elements of any two-pixel clique, and ||-|| is the 
Euclidian distance. In Eqn. (B.3), V-(d&), d(x,)) needs to be summed over 
all two-pixel cliques. Clearly, a spatial configuration of motion vectors with 
larger potential would have a smaller a priori probability. 


Example: Case of a Discrete-Valued GRF 


If the motion vectors are quantized, say to 0.5-pixel accuracy, or we are 
modeling a segmentation label field, then we have a discrete-valued GRE. 
Suppose that a discrete-valued GRF z is defined over the 4 X 4 lattice, shown 
in Figure B.2(a), and Figure B.2(b) and (c) show two realizations of a4 X 4 
binary segmentation label field z. 
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Figure B.2 Demonstration of a discrete-valued Gibbs model: (a) a generic 4x4 lattice, (b) a 
realization of the label field z, (c) another realization of the label field z. 


Let the two-pixel clique potential be defined as 


—B ifz(x,)= 2(x,) 


B otherwise 


Ve (z(x,),2(x,)) = (B.5) 


where B is a positive number. 

There are a total of 24 two-pixel cliques in a 4 X 4 image (shown by 
double arrows). It can be easily seen, by summing all clique potentials, that 
the confgurations shown in Figures B.2(b) and (c) have the Gibbs poten - 
tials —24 B and +24 B, respectively. Clearly, when these Gibbs potentials 
are substituted in (B.3) and then (B.1), the spatially smooth confguration 
depicted in Figure B.2(b) has a higher a priori probability. 


Once a Gibbs distribution (which is a joint pdf) is specified, the local conditional 
pdf induced by this Gibbs distribution can be easily computed as shown below. 


B.3 Computation of Local Conditional Probabilities 
from a Gibbs Distribution 


In certain applications, such as in the Gibbs sampler method of optimization (see 
Appendix C), it may be desirable to obtain the local conditional pdfs from the joint 
pdf given by a Gibbs distribution. Derivation of the local conditional pdfs of a 
discrete-valued GRF is shown in the following. Starting with the Bayes rule, 
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p(z) 


BENS ESE ee ai Ga) 


pee i) ae | (B.6) 
aoe p(z) Aj 


where the second line follows from the total probability rule. In particular, let A, 
denote the event 2(x,) = y, for all y E I’, and B stand for a fixed realization of the 
remaining sites x, ~ x,; then (B.6) is simply a restatement of 


P(A,NB) 


P(A, | B) = =m 

"| yer P(B| A,)P(A,) 

where P(-) denotes probability (obtained by integrating the pdf p(-) over a range). 
Substituting the Gibbs distribution for p(-) in (B.6), the local conditional pdf 


can be expressed in terms of the clique potentials (after some algebra) as 


-> Ecwec Vo (s(n) xec) 


plz(x;)| 2(x,), all x, Fx,)=Q. e (B.7) 


where 


CeecVe (=(x)hx€C) 


Q. = 


z(x,)el 


For a more detailed treatment of MRFs and GRFs the reader is referred to [Gem 
84]. Vigorous treatment of the statistical formulations can also be found in [Bes 74] 
and [Spi 71]. 
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APPENDIX Č 


Optimization Methods 





Several motion-estimation and segmentation problem formulations require mini- 
mization of a non-convex criterion function E(u), where u is some V-dimensional 
unknown vector. Then, the motion-estimation/segmentation problem can be posed 
so as to find 


û = arg min, E(u) 


This minimization is exceedingly difficult due to large dimensionality of the 
unknown vector and the presence of local minima. With non-convex functions, gra- 
dient descent methods (reviewed in Section C.1) generally cannot reach the global 
minimum, because they get trapped in the nearest local minimum. In Section C.2, 
we present two simulated (stochastic) annealing algorithms, the Metropolis algo- 
rithm [Met 53] and the Gibbs sampler [Gem 84], which are capable of finding the 
global minimum at the expense of significant increase in computation time. Next, 
we present three greedy methods that are deterministic approximations to simu- 
lated annealing: iterative conditional modes (ICM) [Bes 74], mean-field annealing 
[Bil 91a], and the highest-confidence-first (HCF) [Cho 90] algorithms to obtain 
faster convergence in Sections C.3. For a detailed survey of popular annealing pro- 
cedures the reader is referred to [Kir 83, Laa 87]. 
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C.1 Gradient-Based Optimization 


A function f(u}... u,) of several unknowns can be minimized by calculating its 
partials with respect to each unknown, setting them equal to zero, and solving the 
resulting equations 








Of(u) o 
Ou, 
(G.1) 
Of(u) _ 4 
ðu, 
simultaneously for u,,..., u,. This set of simultaneous equations can be expressed as 
a vector equation, 
V, fŒ) =0 (C2) 


where V, is the gradient operator with respect to the unknown vector u. Because it 
is difficult to define a closed-form criterion function f(u) for motion estimation or 
image segmentation, to solve the set of equations (C.2) in closed form, we resort to 
iterative (numerical) methods. 


C.1.1  Steepest-Descent Method 


Steepest descent is probably the simplest numerical optimization method. It updates 
the present estimate of the location of the minimum in the direction of the negative 
gradient, called the steepest-descent direction. Since the gradient vector points in 
the direction of the maximum, the direction of steepest descent is just the opposite 
direction, which is illustrated in Figure C.1. 

In order to get closer to the minimum, we update our current estimate as 


a =u a V, fled 


where œ is some positive scalar, known as the step-size. The step-size is critical 
for the convergence of the iterations, because if a is too small, we move by a very 
small amount each time, and the iterations will take too long to converge. On the 
other hand, if it is too large the algorithm may become unstable and oscillate about 
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fy 


Figure C.1 Illustration of local minima and the gradient-descent method. 


the minimum. In the method of steepest descent, the step-size is usually chosen 
heuristically. 


C.1.2 Newton-Raphson Method 


The optimum value for the step-size œ can be estimated using the well-known 
Newton—Raphson method for root finding. Here, the derivation for a function of 
a single variable is shown for simplicity. In one unknown, we would like to find a 
root of f’(z). To this effect, we expand f'(u) in a Taylor series about the point xd as 


(ukrD) = f'(u®) + (uk) — y) £0) 


Since we wish u‘** to be a zero of f'(u), we set 
f'(u®)+ eo Ni u) f” (u™) =0 (C.4) 


Solving (C.4) for u®* D, we have 


This result can be generalized for the case of a function of several unknowns as 


uo) =u -H V, f(u) lo (C.5) 
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where H is the Hessian matrix 


_ & fu) 


7 Ou, Ou, 


The Newton—Raphson method finds an analytical expression for the step-size 
parameter in terms of the second-order partials of the criterion function. When a 
closed-form criterion function is not available, the Hessian matrix can be estimated 
by using numerical methods [Fle 87]. 

The gradient-descent methods suffer from a serious drawback: the solution 
depends on the initial point. If we start in a “valley,” it will be stuck at the bottom 
of that valley, which may be a “local” minimum, as depicted in Figure C.1. Because 
the gradient vector is zero or near zero, at or around a local minimum, the updates 
become too small to move out of a local minimum. One solution to this problem is to 
initialize the algorithm at several different starting points, and then choose the solu- 
tion that gives the smallest criterion function. More sophisticated methods to reach 
the global minimum regardless of the starting point, such as simulated annealing (SA), 
are discussed next. However, SA methods require significantly more processing time. 


C.2 Simulated Annealing 


Simulated annealing (SA) refers to a class of stochastic relaxation algorithms known 
as Monte Carlo methods. They are essentially prescriptions for a partially random 
search of the solution space. At each step of the algorithm, the previous solution is 
subjected to a random perturbation. Unlike deterministic gradient-based iterative 
algorithms, which always move in the direction of decreasing criterion function, 
simulated annealing permits, on a random basis, changes that increase the criterion 
function. This is because an uphill move is sometimes necessary in order to prevent 
the solution from settling in a local minimum. The probability of accepting uphill 
moves is controlled by a temperature parameter. The simulated annealing process 
starts by first “melting” the system at a high enough temperature that almost all 
random moves are accepted. Then the temperature is lowered slowly according to 
a “cooling” regime. At each temperature, the simulation must proceed long enough 
for the system to reach a “steady state.” The sequence of temperatures and the num- 
ber of perturbations at each temperature constitute the “annealing schedule.” The 
convergence of the procedure is strongly related to the annealing schedule. In their 
pioneering work, Geman and Geman [Gem 84] proposed the temperature schedule 
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= 
In(i+1) ” 





% Pages (C.6) 


where 7 is a constant and ż is the iteration cycle. This schedule is overly conservative 
but guarantees reaching the global minimum. Schedules that lower the temperature 
at a faster rate have also been shown to work (without a proof of convergence). 

The process of generating random perturbations is referred to as sampling the 
solution space. In the following, we present two algorithms that differ in the way 
they sample the solution space. 


C.2.1 Metropolis Algorithm 


In Metropolis sampling, at each step of the algorithm a new candidate solution 
is generated at random. If this new solution decreases the criterion function, it is 
always accepted; otherwise, it is accepted according to an exponential probability 
distribution. The probability P of accepting the new solution is then given by 


JAE 
P= et if AE >0 
1 #AFE<0 


where AF is the change in the criterion function due to the perturbation, and T 
is the temperature parameter. If T is relatively large, the probability of accepting a 
positive energy change is higher than when T is small for a given AE. We provide a 
summary of the Metropolis algorithm in the following [Met 53]: 


1. Seti = Oand T= T ax Choose an initial uw at random. 
2. Generate a new candidate solution u%* at random. 
3. Compute AE = E(u“t+)) — E(u). 
4. Compute P from 
AE 
万 三 小 区 了 ifAE>0 
1 if AE <0 


5. If P = 1, accept the perturbation; otherwise, draw a random number that is 
uniformly distributed between 0 and 1. If the number drawn is less than P, 
accept the perturbation. 
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6. Seti=i+1.1fi Z „o where Z a is predetermined, go to 2. 
7. Set i = 0, and u® = u=), Reduce T according to a temperature schedule. If 
T > T iw go to 2; otherwise, terminate. 


Because the candidate solutions are generated by random perturbations, the 
algorithm typically requires a large number of iterations for convergence. Thus, the 
computational load of simulated annealing is significant, especially when the set of 
allowable values I (defned in Appendix B for u discrete) contains a large number of 
values or u is a continuous variable. Also, the computational load increases with the 
number of components in the unknown vector. 


C.2.2 Gibbs Sampler 


Let’s assume that u is a random vector composed of lexicographic ordering of the 
elements of a scalar GRF u(x). In Gibbs sampling, the perturbations are generated 
according to local conditional probability density functions (pdfs) derived from the 
given Gibbsian distribution, according to (B.7) in Appendix B, rather than making 
totally random perturbations and then deciding whether or not to accept them. The 
Gibbs sampler method can be summarized as: 


1. Set T= T ax Choose an initial u at random. 
2. Visit each site x to perturb the value of u at that site as follows: 


a. At site x, compute the conditional probability of u(x) to take each of the 
allowed values from the set I’, given the clique potentials and present values 
of its neighbors using (B.7). 

b. Once the probabilities for all elements of the set are computed, draw the 
new value of u(x) from this distribution. We clarify the meaning of “draw” 
by using an example. 


Example. Suppose that J” = {0,1,2,3}, and it was found that 
P(u(x,) =0| u(x,).x, E N,,) = 0.2 
P(u(x,) = 1| u(x,),x, E N,,) = 0.1 
P(u(x,) 一 2| u(x,),x, E N,,) = 0.4 
P(u(x,) = 3| u(x,),x, E N,,) = 0.3 
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Then a random number R, uniformly distributed between 0 and 1, is gener- 
ated, and the value of u(x,) is decided as follows: 


if 0 < R < 0.2 then u(x) = 0 
if 0.2 = R= 0.3 then ua) = 1 
if 0.3 = R<0.7 then u(x) = 2 

if 0.7 = R= 1 then u(x) = 3 


3. Repeat step 2 sufficiently many times at a given temperature, then lower the 
temperature, and go to 2. Note that the conditional probabilities depend on the 
temperature parameter. 


Perturbations through Gibbs sampling lead to very interesting properties, which 
have been shown by Geman and Geman [Gem 84]: 


。 For any initial estimate, Gibbs sampling will yield a distribution that is asymp- 
totically Gibbsian, with the same properties as the Gibbs distribution used to 
generate it. This result can be used to simulate a Gibbs random field. 

e For the particular temperature schedule (C.6), the global optimum will be 
reached. However, in practice, convergence with this schedule may be too slow. 


C.3 Greedy Methods 


Greedy optimization methods look for simple solutions to complex problems by a 
step-by-step procedure where the solution that provides the most benefit is chosen 
at each step. They can be considered as deterministic approximations to simulated 
annealing. We discuss three such methods: iterated conditional modes, mean-field 
annealing, and highest confidence first. 


C.3.1 Iterated Conditional Modes 


Iterated conditional modes (ICM) algorithm, also known as the greedy algorithm, is 
a deterministic procedure that aims to reduce the computational load of the stochas- 
tic annealing methods. It can be posed as special cases of both the Metropolis and 
Gibbs sampler algorithms. ICM can best be conceptualized as the “instant freezing” 
case of the Metropolis algorithm, i.e., when the temperature T'is set equal to zero for 
all iterations. Then the probability of accepting perturbations that increase the value 


518 Appendix C. Optimization Methods 


of the cost function is always 0 (refer to step 4 of the Metropolis algorithm). Alterna- 
tively, it has been shown that ICM converges to the solution that maximizes the local 
conditional probabilities given by (B.7) at each site. Hence, it can be implemented 
as in Gibbs sampling, but by choosing the value at each site that gives the maximum 
local conditional probability rather than drawing a value based on the conditional 
probability distribution. 

ICM converges much faster than the SA algorithms. However, because ICM only 
allows those perturbations yielding negative AF, it is likely to get trapped in a local 
minimum, much like gradient-descent algorithms. Thus, it is critical to initialize 
ICM with a reasonably good initial estimate. The use of ICM has been reported for 
image restoration [Bes 74] and image segmentation [Pap 92]. 


C.3.2 Mean-Field Annealing 


Mean-field annealing is based on the “mean-field approximation” (MFA) idea in sta- 
tistical mechanics. MFA allows replacing each random variable (random field evalu- 
ated at a particular site) by the mean of its marginal probability distribution at a 
given temperature. Then mean-field annealing is concerned about the estimation of 
these means at each site. Because the estimation of each mean is dependent on using 
the neighboring sites, this estimation is performed using an annealing schedule. The 
algorithm for annealing the mean field is similar to SA except that stochastic relax- 
ation at each temperature is replaced by a deterministic relaxation to minimize the 
so-called mean field error, usually using a gradient-descent algorithm. 

Historically, MFA was limited to Ising-type models described by a criterion func- 
tion involving a binary vector. It was later extended to a wider class of problems, 
including those with continuous variables [Bil 91b]. Experiments suggest that the 
MFA is valid for MRFs with local interactions over small regions. Thus, computa- 
tions of the means and the mean-field error are often based on Gibbsian distri- 
butions. It has been claimed that mean-field annealing converges to an acceptable 
solution approximately 50 times faster than SA. The implementation of MFA is not 
unique. Covering different implementations of mean-field annealing [Orl 85, Bil 92, 


Abd 92, Zha 93] is beyond the scope of this book. 


C.3.3 Highest Confidence First 
The highest-confidence-first (HCF) algorithm proposed by Chou and Brown [Cho 


90] is a deterministic, non-iterative algorithm. It is guaranteed to reach a local mini- 
mum of the potential function after a finite number of steps. 
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In the case of a discrete-valued GRE, the minimization is performed on a site-by- 
site basis according to the following rules: i) Sites with reliable data can be labeled 
without using the a priori probability model. ii) Sites where the data is unreliable 
should rely on neighborhood interaction for label assignment. iii) Sites with unreli- 
able data should not affect sites with reliable data through neighborhood interaction. 

Guided by these principles, a scheme that determines a particular order for 
assigning labels and systematically increases neighborhood interaction is designed. 
Initially, all sites are labeled “uncommitted.” Once a label is assigned to an uncom- 
mitted site, the site is committed and cannot return to the uncommitted state. How- 
ever, the label of a committed site can be changed through another assignment. A 
“stability” measure is calculated for each site based on the local conditional a poste- 
riori probability of the labels at that site, to determine the order in which the sites are 
to be visited. The procedure terminates when the criterion function can no longer be 
decreased by reassignment of the labels. 

Among the deterministic methods, HCF is simpler and more robust than MFA, 
and more accurate than the ICM. Extensions of HCF for the case of continuous 
variables also exist. 
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APPENDIX D 


Model Fitting 





Several motion-estimation, 3D-structure estimation, and image/motion-segmentation 
problems require fitting a model to available data samples. This appendix presents 
model-fitting solutions that are common to these problems. Linear least-squares 
or total least-squares methods are used when available data samples are free from 
outliers. Random-sample consensus (RANSAC) is an effective method to deal with 
data with outliers. 


D.1 Least-Squares Fitting 


In order to keep the presentation simple, let’s suppose that the available data consists 
of pairs of points (x,,y,), --- (Xv Jy) and we want to fit a line y = m x + b to these 
points such that sum of squared vertical distances between the given points and the 
line 6, = y — y, depicted in Figure D.1(a), is minimized. Hence, the problem can be 
stated as: given (x, y1)» --- (Xv yw), find (m, b) to minimize 


Es = (D.1) 


which can be expressed in vector-matrix form as the least-squares solution of a set of 
linear equations of the form A h = y to minimize 
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2 J X a 
m e A be : = = — 2 
; |- ni i 
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where y denotes the vector of observations, h is the vector of unknowns, and A is the 
matrix of coefficients as defined in the equation. Since 


Es=|ly—-Ahbl=(y-AbD'y -A h) =y"y— 2(A h)"y + (A h)"(A h) 


we have 


ore =2 A'Ah—2A'’y=0 (D.2) 


which gives 


h,, =(A™A)'A’y (D.3) 


D.2 Least-Squares Solution of Homogeneous 
Linear Equations 


Given a set of M homogeneous linear equations in N unknowns, M>N, in the 
form 


Ah=0 (D.4) 
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where A is an MXN matrix of coefficients with rank N and h is an NX1 vector 

consisting of unknown model parameters. Assuming the coefficient matrix is noisy, 

the over-determined system A h = 0 is inconsistent and does not have a solution. 
We can define a least-squares solution, given by 


h = arg min,||A hll, subject to ||h|| = 1 


The constraint ||h||= 1 avoids the trivial solution h =0 and sets an arbitrary 
scale factor. We can reexpress the constraint as 


1 一 hzh=0 


and using the Lagrangian formulation, the constraint-optimization problem can be 
expressed as 


h= arg min, {hTAT Ah+A(1—hh)} 


where the solution can be found by solving the equation 


{nA Ah+A(—h’h)}=0 


which yields 
A’Ah+Ah=0 


Hence, we know that h is an eigenvector of AT A and À is an eigenvalue, and the 
least-squares error is given by 


e=h'ATAh=h'Ah 


Therefore, the error will be minimal for A = min, A, and the solution his the 
eigenvector of AI A corresponding to the smallest eigenvalue. 


D.2.1 Alternate Derivation 


An alternate derivation can be based on the singular value decomposition (SVD) of 
the matrix A, A= UV‘, where U is an MXN orthonormal matrix, $ is an NXN 
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diagonal matrix, and U is an NX WN orthonormal matrix. Since U and V are ortho- 
normal, we have 


||Ab|| = |[U2V"h|| = IlEV hl| and ||V"h|| = Ilhll 
If we let y = V'h, then ||Ah|| = ||Xy|| we can pose the problem as 
y = arg min, ||y\|, subject to ||y|| = 1 


Since $ is diagonal and the singular values are sorted in descending order, 
y=[00...0 1]! and hence the solution h=(V")~ly is given by the last column of 
the matrix V. 


D.3 Total Least-Squares Fitting 


The method of total least-squares fits a line to given data samples by modeling the 
uncertainty in both variables as x= x,+€ and y= y,+6. This can be achieved by 
minimizing the sum of squares of the shortest distances d, depicted in Figure D.1(b), 
between the points and the line. The shortest distance between a point (x, y,), and 
the line y= mx + b is given by 


Z,=9 aan eee ô, 


V1+ tan? 0 1+ m? 


where the slope of the line m = tan 0. Hence, the total least-squares problem can be 
formulated as: find (m, 6) to minimize 


1 
Ens = TF ni — mx,— by (D.4) 


A more convenient formulation of the TLS problem is possible if we parameter- 
ize the equation of the line as px + gy = c, where n = [p q]" denotes the unit normal 
vector such that p? +g” =1. Then, the minimum distance is given by | px, + gy, —c|, 
and the problem can be formulated as find (p, q, c) to minimize the sum of squared 
distances 


Ens = ei PX; + DY; 一 e} (D.5) 
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We can first evaluate c in terms of p and q by differentiating the cost function 


with respect to ¢ 


dE 


Fe Le c)=0 (D.6) 
to obtain 
c= px + gy (D.7a) 
where 
F=— 六 andy =— Ol 5 (D.7b) 


Next, we substitute the value of c (D.7) in the cost function (D.5) to solve for 


p and q 





x; = Ji i 了 
Ens = Des (p -a+ —7)) = : | =(Ah) Ah 
x; =x 2 ~~ y q 
(D.8a) 
where 
ye IF 
A=| i: maa=? (D.8b) 
x; — x Ji =y 1 
Differentiating Ey, with respect to h and setting the result equal to zero, 
dns ATAh =0 (D.9) 
dh 


The solution of ATAh = 0, subject to ||h||? =1, is the eigenvector of ATA associ- 
ated with the smallest eigenvalue (i.e., the least-squares solution to the homogeneous 
linear system Ah = 0). 
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An alternative derivation of the TLS solution based on the SVD is presented in 
[Mar 07]. Note that the singular values of an MXN real matrix A are the square 
roots of the eigenvalues of the NXN real, symmetric matrix ATA, and given the SVD 
decomposition A = UBV", the columns of V correspond to eigenvectors of ATA. 


D.4  Random-Sample Consensus (RANSAC) 


Random-sample consensus (RANSAC) is a general framework for model fitting in 
the presence of outliers. It can be summarized by the following steps: 


1. Choose a minimal subset of the original data, called the hypothetical inliers set, 
at random. 

2. Fit the model to the selected set of hypothetical inliers. 

3. Test how well the remaining data points fit the model. Those points that fit the 
estimated model well are recorded as members of the consensus set. 

4. Refine the model by re-estimating it using all members of the consensus set. 

5. The estimated model is deemed good if sufficiently many data points have been 
classified as part of the consensus set. Hence, we keep the refined model if 
its consensus set is larger than the previously saved model; otherwise, we go to 
step 1 and repeat the process. 


This procedure is repeated N times where JV is a fixed number, each time produc- 
ing either a new refined model with a corresponding consensus set or a rejected model. 
RANSAC produces a reasonable result with a certain probability, which increases as 
the number of repetitions /JN increases [Fis 81]. When X is limited, RANSAC may 
perform poorly if the number of inliers is less than 50% of the data, and the solu- 
tion may not be optimal even for moderately contaminated sets. RANSAC can only 
estimate one model for a given data set. When the data set can be modeled by two 
(or more) models, the Hough transform is a more robust technique for finding cor- 
rect models. 
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Glossary (术语 表 ) 


Above-right predictor (ARP) 右上 预测 器 (用 
于 对 当前 块 的 右上 角 区 域 运动 矢量 的 获取 ) 
Accommodation distance ” 调 视 距离 ( 为 使 得 视 


网 膜 上 的 成 像 清 晰 而 调整 的 距离 ) 
Advanced residual prediction (ARP) 高 级 残 差 
预测 


Advanced Television System Committee (ATSC) 
standard ATSC 标准 ( 是 美国 的 数字 电视 
国家 标准 ， 由 美国 的 高 级 电视 业务 顾问 委员 
会 (Advanced Television System Committee ) 
于 1995 年 9 月 正式 通过 ) 

AE (angular error) 角 误 差 

Affine model 仿 射 模型 

Alpha-trimmed mean filters ”修正 的 a 均值 滤波 器 

AMVP(Advanced-motion vector prediction) ”高 
级 运动 矢量 预测 

Anaglyph 补 色 立体 图 

销 帧 图 像 (AP ) ( 多 视点 视频 
编码 中 的 参考 帧 ， 编 码 时 不 用 于 空间 预测 ， 
而 只 用 于 视点 间 预 测 ) 

Anti-alias filtering HORAE 

Aperture problem ”孔径 问题 

Apparent motion MZZ (在 本 书 中 指 ， 通 
过 观察 视频 感知 到 的 对 应 场 或 光 流 场 ) 

Artifacts 引入 失真 (原意 为 人 工 产 品 ; 也 可 译 
为 伪 影 ， 即 图 像 处 理 过 程 中 引入 的 各 种 失 
真 或 非 理想 情况 ) 

ASO (arbitrary slice ordering) 任意 条 带 顺 序 

Asymmetric property 不 对 称 特 性 

AWA (adaptive-weighted-averaging) filter 
应 加 权 均 值 滤波 器 


Anchor pictures 


A id 


Backward-motion estimation 后 向 运动 估计 

Bayer color-filter array pattern Bayer 彩色 滤波 
阵列 模式 ( 指 彩 色 系 统 中 的 RGB 三 原色 按 
照 GRGR/BGBG 排列 的 一 种 阵列 模式 ， 最 
早 由 Bayer 提出 ， 又 称 为 Bayer color-filter 
pattern ) 

Benchmarking 基准 

双 目 深度 线索 

断 链 访问 图 像 

Blind-image restoration 图像 盲 复原 

Blur 模糊 (由 于 运动 、 镜 头 散 焦 、 噪 声 等 引起 
的 图 像 锐 度 降低 ) 

CABAC (context-adaptive binary arithmetic 
coding) 上 下 文 自 适应 二 进 制 算术 编码 


Bi-lateral filters 
Binocular depth cues 


BLA (broken-link access) picture 


Cartesian coordinates ” 笛 卡 尔 坐标 系 
CAVLC (context-adaptive variable-length coding) 
上 下 文 自 适应 变 长 编码 
CCMF (cross-correlated multi-frame) Wiener filter 
互相 关 多 帧 维 纳 滤波 器 
CCN (content centric networking) ”内 容 中 心 网 络 
CDF (cumulative distribution function) ”累积 分 
Hi PRK 
CEA (Consumer Electronics Association) 消费 
电子 协会 
Center of projection ”投影 中 心 
CFA (color filter array) interpolation 彩色 滤波 
阵列 内 插 
Chrominance 色 度 


Chunks (数据 ) 块 


CIE (International Commission on Illumination) 
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国际 照明 委员 会 

CLS (constrained least-squares) filtering 有 约束 
最 小 二 乘 滤波 器 

Clustering RÆ 

Coiflet filters Coiflet 滤波 器 (Coiflet 小 波 系数 
滤波 器 ) 

Contouring artifacts ”轮廓 伪 影 

Covariance-based adaptive interpolation 
方差 的 自 适应 内 插 

CR (compression ratio) ”压缩 率 

CR (conditional replenishment) ”条件 补偿 


清除 随机 访 


基于 协 


CRA (clean random-access) picture 
问 图 像 

Critical velocity ”临界 速度 

DBBP (depth-based block partitioning) 
度 的 块 分 割 

DCP (disparity-compensated prediction) 视差 补 
偿 预 测 

Deblurring of images 图像 去 模糊 

Decimation # (信号 降 采 样 ) 

Dehomogenization ” 非 均匀 化 

De-mosaicking ”去 马赛 克 

Dense-correspondence estimation problem 稠密 
对 应 估计 问题 

Dense-motion (optical flow/displacement) 
estimation ”稠密 运动 〈( 光 流 /位 移 ) 估计 

DFD (displaced-frame difference) 位移 帧 间 差 

Diamond search (DS) “钻石 搜索 

DIBR (depth-image based rendering) ”基于 深度 
图 像 的 绘制 

Differential methods] 差分 法 

Diffusion-based-in-painting ”基于 扩散 的 图 像 修 复 

Digital dodging-and-burning 数字 淡化 和 加 深 
〈 不 同 程度 的 数字 化 曝光 ) 

Digital micromirror devices (DMDs) ”数字 微 镜 


基于 深 


Defocus blur 


晶片 
Digital Terrestrial Multimedia Broadcasting 
数字 多 媒体 地 面 广播 标准 
直接 线性 


standards 

Direct Linear Transformation (DLT) 
变换 

Disocclusion regions 

Distortion ”失真 

DMS (discrete memoryless source) ”离散 无 记忆 源 

DMVP (depth-based motion vector prediction) 
基于 深度 的 运动 矢量 预测 

Down-conversion 下 转换 

Down-sampling (sub-sampling)” 亚 采样 

DVI (Digital Visual Interface) standard 数字 视 
频 接口 标准 

DWT (discrete-wavelet transform) ”离散 小 波 变 换 

EM (expectation-maximization) algorithm 期 望 
最 大 化 算法 

Embedded block coding with optimized truncation 
(EBCOT) ”优化 截断 的 租 入 块 编码 

Encrypted DCP files ”加密 DCP 文件 

Entropy coding iiS 

Epipolar geometry 极 几 何 ( 双 目 立体 匹配 中 两 
个 透视 相机 间 的 一 种 特殊 的 几何 关系 ) 

ETSI (European Telecommunications Standards 
Institute) ”欧洲 电信 标准 协会 

European Broadcasting Union (EBU) ”欧洲 广播 
联盟 


Exemplar-based methods 


空洞 区 


基于 样 例 的 方法 

EZW (embedded zerotree wavelet transform) $% 
入 小 波 零 树 变 换 

Fidelity range extensions (FRExt) 保 真 度 范围 
扩展 

Field pictures A 

Four-fold symmetry ”四 元 对 称 

FR (full reference metrics) ”全 参考 和 矩阵 

自由 视角 2D 视频 


Free-view 2D video 


注视 转移 眼 动 

Gibbs random field (GRF) Gibbs 随机 场 

GOP (group of pictures) 图像 组 

GT (ground-truth) data 标准 数据 ( 指 得 到 的 原 
始 真实 数据 ) 

GVR (gradual view refresh) ”渐进 刷新 

HCF (highest confidence first) algorithm 最 大 可 
信 度 优先 算法 

HDMI (High-Definition Multimedia Interface) 
高 清晰 度 多 媒体 接口 

HDS (HTTP Dynamic Streaming) HTTP 动态 流 


Gaze-shifting eye movements 


Hermitian symmetric Hermitian 44g 

HEVC (high-efficiency video-coding) standard 
高 效 视频 编码 标准 (ITU-T 和 MPEG 联 
A, ¥ 20134415 23 日 正式 发 布 的 视频 
压缩 编码 国际 标准 ) 

Hexagonal matching ”六 边 形 匹 配 

直方 图 

Homography 单 应 性 (一 一 对 应 性 ) 

IC (illumination compensation) ”亮度 补偿 


国际 色彩 


Histogram 


ICC (International Color Consortium) 
联盟 

ICM (iterated conditional mode) 条件 迭代 模式 

ICT (irreversible color transform) 单 向 色彩 变换 

IDD-BM3D (iterative decouple deblurring- 
BM3D) i&{05 4624 — BM3D 

IDR (instantaneous decoding refresh) picture 即 
时 解码 刷新 图 像 ( H.264 中 的 IDR 帧 ) 

Ill-posed problem ”病态 问题 

照度 

Illumination compensation coding tool 


偿 编 码 工 具 


Image rectification 


Illumination 


照度 补 


图 像 校 正 
Image sharpening 图 像 锐 化 
图 像 平 滑 


Inertial sensing motion tracking ”惯性 感应 运动 


Image smoothing 
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跟踪 

Integer transforms ”整数 变换 
整数 值 失 量 
Integrated Multimedia Broadcasting (ISDB) 
综合 多 媒体 广播 标准 

交错 栅 格 
隔行 视频 输入 
Interlaced scanning ”隔行 扫描 
Interlaced video ”隔行 制式 视频 


Interpupilar distance ME 


Integer-valued vectors 


standard 
Interlace lattice 


Interlace video input 


两 个 点 阵 的 交集 
帧 内 压缩 模式 
不 规则 重复 


Intersection of two lattices 

Intra-frame compression modes 

Irregular Repeat-Accumulate codes 
累积 码 

Isometry 刚体 变换 ( 原意 为 等 距 ， 文中 指 只 有 
旋转 和 平移 ， 没 有 其 他 变换 的 仿 射 模型 ) 

同 构 信号 

Kernel selection ” 核 选 择 

K-means algorithm K- 均值 算法 

K-nearest neighbor method K- 近邻 方法 

兰 氏 反射 模型 


Isomorphic signals 


Lambertian reflectance model 

Lattice(s) HS 

Least-squares (LS) solution 最 小 二 乘法 

Lempel-Ziv coding LZ 编码 

透镜 片 

词典 顺序 

Linear contrast manipulation ”线性 对 比 度 操作 

Linear forward wavelet transform ”线性 前 向 小 波 
变换 

LMMSE (linear minimum mean-square error) 
filter 线性 最 小 均 方 误差 滤波 器 

Log-luminance domain ”亮度 对 数 域 

低 分 辩 率 帧 

LSB (least significant bit-plane) ”最低 有 效 比 特 
位 面 

Mach band effect 


Lenticular sheets 


Lexicographic order 


LR (low-resolution) frames 


马赫 带 效 应 ( 对 于 不 同 亮度 
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区 域 之 间 的 边界 区 域 ， 人 类 视觉 系统 会 过 

高 或 过 低估 计 其 亮度 值 的 现象 ) 

机 器 学 习 方 法 

MAD (minimum mean absolute difference) 最 小 
平均 绝对 差 

Markov random field (MRF) Markov 随机 场 

Marr-Hildreth scale space theory M-H 尺度 空间 
理论 

Masking 


Machine-learning methods 


蒙 版 (HERA ) 

Maximum-likelihood segmentation ”最 大 似 然 分 割 

Maxshift method ”最 大 平移 法 

运动 补偿 

MCP (motion-compensated prediction) 
偿 预 测 

Mean square difference (MSE) 均 方 差 

Mean-shift (MS) algorithm 均值 漂移 算法 

均 方 量化 误差 

最 大 似 然 


MC (motion compensation) 


运动 补 


Mean-square quantization errors 

ML (maximum likelihood) estimate 
估计 

MMCO (memory-management control- operation) 
内 存 管 理 控制 操作 

Mosaic representation 马赛克 表 示 ( 指 photo 
mosaic, BU AHH ik KSA, HA 
张 窄 视角 照片 拼 贴 成 宽 视角 照片 ) 

Mosquito noise XIRS 

运动 轨迹 

MPC (maximum matching pel count) 最 大 匹配 
像素 计数 

MPEG (Moving Picture Experts Group) 活动 图 
像 专家 组 ， 也 是 图 像 压 缩 标 准 的 名 称 

MRF (Markov random field) Markov 随机 场 

复杂 运动 


Motion trajectory 


Multiple motions 


Mnulti-resolution frame difference analysis 多 分 
辨 率 帧 间 差 分 分 析 
Multi-resolution pyramid representations ”多 分 辨 


率 金 字 塔 表示 


Multi-scale representation ”多 尺度 表示 

Multi-view video coding (MVC) standard 多 视 
点 视频 编码 (MVC ) 标准 

MVD (multi-view-video-plus-depth) format 多 
视点 加 深度 格式 

NLM (non-local means) filtering 非 局 部 均值 滤 
Kat 

NSHP (Non-symmetric half-plane) support JEX} 
称 半 平 面 支撑 

Nyquist criterion ZÆ R E 

Optical flow Xi 

Order-statistics filters ” 阶 统计 滤波 咒 

正 交 滤波 器 

Perfect-reconstruction (PR) property 完美 重 构 
特性 

PES (packetized elementary streams) 可 打包 元 
素 流 

PEVQ (perceptual evaluation of video quality) 
视频 质量 感知 评估 

PM&S (pattern matching and substitution) 模式 
匹配 和 替代 


Polarizing filters 


Orthogonal filters 


偏振 滤波 器 

Polyphase implementation of decimation filters 
抽取 滤波 器 的 多 相 实 现 

Projections onto convex sets (POCS) formulation 
凸 集 投 影 公 式 

RADL (random access decodable leading) ”随机 
接 人 可 解码 引导 

RASL (random access skipped leading) ”随机 接 
入 跳跃 引导 

Reciprocal lattice ART 

Recognition/example-based methods 
样 例 的 方法 


Recursive filters 


基于 识别 / 


递归 滤波 器 
Recursively computable prediction model 


算 递 归 预 测 模型 


可 计 


Redundancy reduction JURA 

Reversible color transform (RCT) 可逆 彩色 变换 

振 铃 效应 

RLC (run-length coding) 行程 编码 

感 兴趣 区 域 编 码 

RTMP (Real-Time Messaging Protocol) ”实时 消 
息 协 议 

RTP (Real-time Transport Protocol) ”实时 传输 协议 

RTSP (Real-Time Streaming Protocol) ”实时 流 协议 

Run mode 运行 模式 

SAD (sum of absolute differences) 绝对 误差 和 

SEA (successive elimination algorithm) 逐次 消 
元 算法 

scale-invariant feature transform (SIFT) 尺度 不 
变 特 征 变换 (一般 直接 写成 SIFT ) 

SKL (sequential Karhunen-Loeve) algorithm JF 
列 K-L 算法 

Smearing，see Blur 模糊 

SMV (super multi-view) displays 


Ringing artifacts 


ROI (region-of-interest) coding 


超 多 视角 显示 

Snakes (active-contour models) $ë $ I (主动 
轮廓 模型 ) 

SNR (signal-to-noise ratio) {AIR EE 

FRIR 


Sparse-correspondence estimation ”稀疏 相似 估计 


Sparse representations 


A ie # 531 

Statistical redundancy ”统计 宛 余 

Stein’s unbiased risk estimate (SURE) Stein 无 
偏 代 价 估计 

Sub-pixel motion estimation ” 亚 像素 运动 估计 

Super-resolution ” 超 分 辩 率 重建 

SVC (scalable-video coding) ”尺度 可 变 视 频 编码 

SVD (singular-value decomposition) 奇异 值 分 解 

纹理 编码 工具 

时 间 递 归 滤 波 器 

Ultra-high definition television (UHDTV) 超 高 
清 电视 

Unsharp masking (USM) ” 虚 光 蒙 版 


Texture coding tools 


TR (time-recursive) filters 


i Up-conversion ”上 转换 


URQ (uniform reconstruction quantization) J4— 
化 重 构 量 化 

VGA (Video Graphics Array) display standard 
视频 图 形 阵列 显示 标准 

VLC (variable-length source coding) 

小 波 收缩 

小 波 变换 编码 

弱 透 视 投影 


变 长 源 编码 

Wavelet shrinkage 

Wavelet transform coding 

Weak-perspective projection model 
模型 

XGA(Extended Graphics Array) 扩展 图 形 阵列 
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